| 123 | |
| 124 | == Examples == |
| 125 | That was more than enough code to get me started. Here are the templates I used for testing. I extract an image, alt text, next and previous links, and id for each comic. The links are understood by the fetcher. ''url'' is a pattern which describes which files a template applies to. A MySQL ''LIKE'' is being used for that right now, hence the '%'s. You can see I prefer the xpath queries so far. They seem to be quite robust for this purpose. |
| 126 | |
| 127 | {{{ |
| 128 | +------------------------------------------+----------+---------------------------------------------------+----------------+-------------+ |
| 129 | | url | type | pattern | meaning | format | |
| 130 | +------------------------------------------+----------+---------------------------------------------------+----------------+-------------+ |
| 131 | | http://www.penny-arcade.com/comic/% | xpath | //div[@id="comicstrip"]/img/@src | comic:img | %(makeurl)s | |
| 132 | | http://www.penny-arcade.com/comic/% | xpath | //div[@id="comicstrip"]/img/@alt | comic:alt | NULL | |
| 133 | | http://www.penny-arcade.com/comic/% | xpath | //a[img[@alt="Next"]]/@href | comic:next | %(makeurl)s | |
| 134 | | http://www.penny-arcade.com/comic/% | xpath | //a[img[@alt="Back"]]/@href | comic:previous | %(makeurl)s | |
| 135 | | http://www.penny-arcade.com/comic/% | xpath | //input[@name="Date"]/@value | comic:id | NULL | |
| 136 | | http://questionablecontent.net/view.php% | xpath | (//a[text()="Next"]/@href)[1] | comic:next | %(makeurl)s | |
| 137 | | http://questionablecontent.net/view.php% | xpath | (//a[text()="Previous"]/@href)[1] | comic:previous | %(makeurl)s | |
| 138 | | http://questionablecontent.net/view.php% | xpath | //center/img[starts-with(@src, "./comics/")]/@src | comic:img | %(makeurl)s | |
| 139 | | http://questionablecontent.net/view.php% | urlregex | (\d+)$ | comic:id | NULL | |
| 140 | | http://sinfest.net/archive_page.php% | xpath | //img[contains(@src, "/comics/")]/@src | comic:img | %(makeurl)s | |
| 141 | | http://sinfest.net/archive_page.php% | xpath | //img[contains(@src, "/comics/")]/@alt | comic:alt | NULL | |
| 142 | | http://sinfest.net/archive_page.php% | xpath | //a[img[@alt="Next"]]/@href | comic:next | %(makeurl)s | |
| 143 | | http://sinfest.net/archive_page.php% | xpath | //a[img[@alt="Previous"]]/@href | comic:previous | %(makeurl)s | |
| 144 | | http://sinfest.net/archive_page.php% | urlregex | (\d+)$ | comic:id | NULL | |
| 145 | +------------------------------------------+----------+---------------------------------------------------+----------------+-------------+ |
| 146 | }}} |
| 147 | |
| 148 | The results? It is working exactly as I expected. I seeded a few URLs from each comic and then alternated running the fetcher and scraper, and my collection of structured data about these comics grew. |
| 149 | |
| 150 | == Presentation == |
| 151 | I know for this to be useful I need to be able to easily produce appealing-looking reports of this data. My first attempt is with producing RSS feeds with [http://genshi.edgewall.org/ genshi]. |
| 152 | |
| 153 | My no-nonsense template looks like this: |
| 154 | {{{ |
| 155 | #!xml |
| 156 | <rss version="2.0" |
| 157 | xmlns:py="http://genshi.edgewall.org/"> |
| 158 | <channel> |
| 159 | <title>${title}</title> |
| 160 | <py:for each="item in items"> |
| 161 | <item> |
| 162 | <title>${item.alt}</title> |
| 163 | <description><img src="${item.img}" alt="${item.alt}" /></description> |
| 164 | <guid>${item.url}#${item.id}</guid> |
| 165 | </item> |
| 166 | </py:for> |
| 167 | </channel> |
| 168 | </rss> |
| 169 | }}} |
| 170 | |
| 171 | The program to put everything together looks like this: |
| 172 | {{{ |
| 173 | #!python |
| 174 | import MySQLdb |
| 175 | from genshi.template import TemplateLoader |
| 176 | import sys |
| 177 | |
| 178 | db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db') |
| 179 | cursor = db.cursor() |
| 180 | |
| 181 | if len(sys.argv) != 3: |
| 182 | print 'Usage: %s urlpattern title' |
| 183 | sys.exit(1) |
| 184 | (urlpattern, title) = sys.argv[1:] |
| 185 | |
| 186 | items = [] |
| 187 | cursor.execute('SELECT url FROM data WHERE meaning="comic:id" AND url LIKE "%s" ORDER BY value DESC' % urlpattern) |
| 188 | fields = db.cursor() |
| 189 | for (url,) in cursor: |
| 190 | fields.execute('SELECT meaning, value FROM data WHERE url=%s AND meaning LIKE "comic:%%"', url) |
| 191 | item = {'url': url} |
| 192 | for (meaning, value) in fields: |
| 193 | item[meaning.split(':', 1)[1]] = unicode(value, 'utf8') |
| 194 | items.append(item) |
| 195 | |
| 196 | loader = TemplateLoader('.') |
| 197 | template = loader.load('comicrss.xml') |
| 198 | stream = template.generate(title=title, items=items) |
| 199 | print stream.render('xml') |
| 200 | }}} |
| 201 | |
| 202 | The script works under the assumption that there is one ''comic:id'' per comic, and any other tuples for a url with a ''comic:id'' is data relevant to that comic. The template binds to the specific fields it cares about and produces an RSS XML document. |