Context Navigation

Changes between Version 3 and Version 4 of Projects/Scraper

Timestamp:: Dec 28, 2007, 4:09:13 PM (17 years ago)
Author:: Cory McWilliams
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Projects/Scraper

-              v3
+              v4
 The script works under the assumption that there is one ''comic:id'' per comic, and any other tuples for a url with a ''comic:id'' is data relevant to that comic.  The template binds to the specific fields it cares about and produces an RSS XML document.
+== Thoughts on Improvements ==
+I know this thing has a lot of shortcomings, but I think it is well on its way to being what I want.  Here are some of the things I have in mind at the moment.
+ * There should be a web interface for manipulating templates, files, and fetching.  Editing the DB contents by hand is far from ideal, and a web interface could show results very clearly.
+ * Pages that change haven't been accounted for.  For example, the page for the latest comic might not have a ''next'' link until the next comic is available.  The fetcher needs to know to re-fetch that page in those circumstances.
+ * Comic images should be fetched and referenced locally.
+ * The fetcher should be rate limited.  I am currently running it only periodically in a way that it only fetches one or two pages per site, but something should be built in so that it doesn't hammer sites.
+ * The scripts should have a common configuration instead of hardcoded DB connection data in each one.
+ * This needs to be tested with many more comics.
+ * This needs to be tested with something that is entirely unlike comics.
+ * genshi for templating works great for this specific case, but it might be preferable to allow for user-defined templates, which might require a sandboxable template system.
+ * I should learn how badly I'm butchering RDF concepts.
+ * Document templates need to be decoupled from the program which generates documents from them.
+[[AddComment]]