zerosum dirt(nap)

evolution through a series of accidents

zerosum dirt(nap)

An Introduction to Scraping with Hpricot

June 30, 2007 by nap · Comments

For one of my hobby projects, I’ve been building a comic book release schedule webapp in Ruby. Obviously, a large part of that involves locating data sources for comic book publishers and importing those sources. Unfortunately, none of the major publishers have seen fit to make their release schedules available in RSS or Atom or an other structured format for that matter. Sigh.

Fortunately, all is not lost. With the Hpricot gem and a little scraping know-how, we can overcome almost any parsing obstacle, as long as the data is in a somewhat predictably arranged state. Let’s see how it works…

For our example, we’ll consider DC Comics, home of Superman, Batman, Aquaman, and… Super-Chief (Apache Chief? No, he’s different). DC makes their weekly release schedule available through their website at this URL. That’s nice and convenient. But it’d certainly be more convenient if they had a feed available. (If they do have a feed available, hidden deep within their website, and I haven’t found it, please let me know!)

As we click through to next/previous weeks and it becomes pretty clear that passing the dat= parameter gives you the appropriate listing. Note that they display a month at a time, so all you really have to do is ask for dat=01 every time. We’re going to build a little scraper that just grabs the current months’ books, but armed with the knowledge of how this works, you should find grabbing 3-4 months worth of books at a time to be no challenge whatsoever (comic book publishers usually solicit about 3 months in advance).

OK. So now let’s take a peep at the structure of the document itself. We can do this by just viewing source in a browser. It seems that every comic listed in the release schedule has a link to a full description of the issue, with a cover art previews, a short synopsis, writers/artists listed, etc. And every one of those links seems to have a CSS class of ‘contentLink’. Oh, lucky day.

This is certainly starting to smell like a job for Hpricot, the super fast (and delightful!) HTML parser for Ruby, written by the enigmatic why the lucky stiff. Gem install that sucker!

gem install hpricot

Now let’s fire up IRb and chew on some delicious Ruby syntax:

require 'hpricot'
require 'open-uri'

URL_DC = "http://www.dccomics.com/comics/"

doc = Hpricot(open("#{URL_DC}?dat=#{Time.now.strftime('%y%m01')}"))
books = (doc/"a.contentLink")
books.each { |book| read\_comic(book.innerHTML.strip, 
  "#{URL_DC}#{book.attributes['href']}") }

def read\_comic(title, url)
  puts "#{title} - #{url}"
end

Run that, and you’ll get a list of stuff that looks like this:

THE ALL-NEW ATOM #12 – http://www.dccomics.com/comics/?cm=7447
BATMAN: TURNING POINTS – http://www.dccomics.com/comics/?cm=7251

Each output line lists a title with a URL, for each comic solicited in a given month. How does it work? Well, first we open the URL and feed it into Hpricot. Then the line books = (doc/“a.contentLink”) uses a CSS selector to yank out just the elements that match the selector. We could have also used XPath-style syntax to accomplish the same thing. Anyway, those elements we’re selecting are all the links to comics being released this month. Hpricot hands us an array of these elements, and then we iterate over them, calling the read_comic function and passing it the title (the innerHTML of the link, stripped of excess whitespace), and the URL (an absolute link to the href attribute of the link).

Next, let’s beef up the read_comic function to do something useful. We’ll have it make another remote connection, this time to the URL specified for the detailed comic description, parse out the talent, description, and some other information about the issue and stuff it into a model object. But first let’s examine the source of one of those pages. The Trials of Shazam! #7 should do nicely.

We note in our examination of the page source that the data we want to scrape is all contained in tags, with different classes, as listed below. Note that this time we’ve chosen to use XPath-style syntax for the selectors. Note also that the span tag with class=“display\_copy” appears twice. The first time, it contains what appears to be the description of the issue, and the second time it lists the publication date. So instead of returning a single element, display\_copy gets an Array of 2 (or possibly more) elements.

def read\_comic(title, url)
  doc = Hpricot(open(url))
  display\_talent = (doc/"span[@class=display\_talent]").innerHTML
  display\_copy = (doc/"span[@class=display\_copy]") # 2 elements
  puts "====="
  puts "title: #{title}"
  puts "talent: #{display\_talent}"
  puts "copy (0): #{display\_copy[0].innerHTML}
  puts "copy (1): #{display\_copy[1].innerHTML}
end

Now we’re iterating through each book from the remote source, and dumping out it’s title, the writer and artist responsible for it, a quick synopsis, and some other information (publication date, etc). Alright. If we just had a Comic model in our application, we could be somewhere!

So let’s make one. In fact, let’s do it in Ruby, with ActiveRecord. First the schema:

DROP DATABASE IF EXISTS comics;
CREATE DATABASE comics;
USE comics;

CREATE TABLE comics (
  id int(11) NOT NULL AUTO\_INCREMENT,
  name VARCHAR(255),
  publisher VARCHAR(255),
  talent VARCHAR(255),
  description TEXT,
  published\_on DATETIME,
  PRIMARY KEY (id)
);

Load this up and then add the following code to the top of your comics scraper. In fact, put it in a file called comics.rb so you can execute it on the command line.

require 'active\_record'

ActiveRecord::Base.establish\_connection(
  :adapter  => 'mysql',
  :host     => 'localhost',
  :username => 'root',
  :password => '',
  :database => 'comics')

class Comic < ActiveRecord::Base
end

Now we’ve established a connection to the database via ActiveRecord and defined a Comic model that inherits from ActiveRecord::Base, thus wrapping our database schema and giving us some handy getters and setters. Our next step will be to trade in the read_comic function in favor of an import class method on the Comic model.

class Comic < ActiveRecord::Base
  def self.import(title, url)
    doc = Hpricot(open(url))
    display\_talent = (doc/"span[@class=display\_talent]").innerHTML
    display\_copy = (doc/"span[@class=display\_copy]") # 2 elements

    comic = Comic.new(:name => title)
    comic.publisher = "DC"
    comic.talent = display\_talent
    comic.description = display\_copy[0].innerHTML
    comic.published\_on = Date.parse(display\_copy[1].innerHTML.
      sub('on sale', ''))

    comic
  end
end

When Comic.import receives a title and a URL it makes a connection to the URL specified and fires up Hpricot. It uses Hpricot to parse out the information we’re looking for, and then instantiates an instance of the Comic class. We set the talent, the description (the first of the display\copy spans) and then parse the date out from the second display\copy span.

We’ll remove all the output from there and put it in the book loop, since it’s clearly not the job of the model code to be rendering a view of any sort. Our new book loop will use Comic.import on each element of the books Array, creating the model, saving it, and then printing out some attributes. Here’s the final code for comics.rb:

require 'rubygems'
require 'active\_record'
require 'open-uri'
require 'hpricot'

ActiveRecord::Base.establish\_connection(
  :adapter  => 'mysql',
  :host     => 'localhost',
  :username => 'root',
  :password => '',
  :database => 'comics')

URL_DC = "http://www.dccomics.com/comics/"

class Comic < ActiveRecord::Base
  def self.import(title, url)
    doc = Hpricot(open(url))
    display\_talent = (doc/"span[@class=display\_talent]").innerHTML
    display\_copy = (doc/"span[@class=display\_copy]") # 2 elements?

    comic = Comic.new(:name => title)
    comic.publisher = "DC"
    comic.talent = display\_talent
    comic.description = display_copy[0].innerHTML
    comic.published\_on = Date.parse(display\_copy[1].innerHTML.
      sub('on sale', ''))

    comic
  end
end

doc = Hpricot(open("#{URL_DC}?dat=#{Time.now.strftime('%y%m01')}"))
books = (doc/"a.contentLink")
books.each do |book|
  comic = Comic.import(book.innerHTML.strip, 
    "#{URL_DC}#{book.attributes['href']}")
  if comic.save
    puts "====="
    puts "name: #{comic.name}"
    puts "description: #{comic.description}"
    puts "release date: #{comic.published\_on}"
  else
    puts "uh-oh! we should handle errors!"
  end
end

And here’s the final result:

name: TRIALS OF SHAZAM! #7 (OF 12)
description: Freddy must find Hercules for his next trial,
which is considerably more difficult than he expected,
since Herc is behind bars!
release date: 2007-06-13

Obviously we can do a lot more with this. We can build a series model, that has_many issues or episodes. We can build a publisher model. We can suck in the images and use RMagick to generate thumbnails. We can discriminate between graphic novels, trade paperbacks, and issues of a standard series book. We can roll this into a Rails application, and allow the results to be browsable, users to add comics to their pull lists, create collections, comment on them, rate them, and so on. Actually, that’s exactly what I’m working on for my hobby project (if you’re interested, email me and I’ll let you take a look — I’m hoping to release it relatively soon-ish).

To go further with scraping, we’ll need to pay particular attention to handling errors, because it’s an inexact science and, since we have no hard format, things are subject to change or break in weird ways. That’s the obvious downside to scraping. But when you have no other alternative for automating mass import of data like in this scenario, it’s certainly a good thing to know how to do.

If you want to learn more, \_why’s Hpricot site is chock full of useful information, and you may also want to check out scRUBYt, which combines Hpricot and WWW::Mechanize into a full-on web scraping “toolkit”.

blog comments powered by Disqus