zerosum dirt(nap)

evolution through a series of accidents

zerosum dirt(nap)

An Introduction to Scraping with Hpricot

June 30, 2007 @ 03:49 PM by nap · 0 comments

For one of my hobby projects, I've been building a comic book release schedule webapp in Ruby. Obviously, a large part of that involves locating data sources for comic book publishers and importing those sources. Unfortunately, none of the major publishers have seen fit to make their release schedules available in RSS or Atom or an other structured format for that matter. Sigh.

Fortunately, all is not lost. With the Hpricot gem and a little scraping know-how, we can overcome almost any parsing obstacle, as long as the data is in a somewhat predictably arranged state. Let's see how it works...

For our example, we'll consider DC Comics, home of Superman, Batman, Aquaman, and... Super-Chief (Apache Chief? No, he's different). DC makes their weekly release schedule available through their website at this URL. That's nice and convenient. But it'd certainly be more convenient if they had a feed available. (If they *do* have a feed available, hidden deep within their website, and I haven't found it, please let me know!)

As we click through to next/previous weeks and it becomes pretty clear that passing the dat= parameter gives you the appropriate listing. Note that they display a month at a time, so all you really have to do is ask for dat=01 every time. We're going to build a little scraper that just grabs the current months' books, but armed with the knowledge of how this works, you should find grabbing 3-4 months worth of books at a time to be no challenge whatsoever (comic book publishers usually solicit about 3 months in advance).

OK. So now let's take a peep at the structure of the document itself. We can do this by just viewing source in a browser. It seems that every comic listed in the release schedule has a link to a full description of the issue, with a cover art previews, a short synopsis, writers/artists listed, etc. And every one of those links seems to have a CSS class of 'contentLink'. Oh, lucky day.

This is certainly starting to smell like a job for Hpricot, the super fast (and delightful!) HTML parser for Ruby, written by the enigmatic why the lucky stiff. Gem install that sucker!

gem install hpricot

Now let's fire up IRb and chew on some delicious Ruby syntax:

require 'hpricot'
require 'open-uri'

URL_DC = "http://www.dccomics.com/comics/"

doc = Hpricot(open("#{URL_DC}?dat=#{Time.now.strftime('%y%m01')}"))
books = (doc/"a.contentLink")
books.each { |book| read_comic(book.innerHTML.strip, 
  "#{URL_DC}#{book.attributes['href']}") }

def read_comic(title, url)
  puts "#{title} - #{url}"
end

Run that, and you'll get a list of stuff that looks like this:

THE ALL-NEW ATOM #12 - http://www.dccomics.com/comics/?cm=7447
BATMAN: TURNING POINTS - http://www.dccomics.com/comics/?cm=7251

Each output line lists a title with a URL, for each comic solicited in a given month. How does it work? Well, first we open the URL and feed it into Hpricot. Then the line books = (doc/"a.contentLink") uses a CSS selector to yank out just the elements that match the selector. We could have also used XPath-style syntax to accomplish the same thing. Anyway, those elements we're selecting are all the links to comics being released this month. Hpricot hands us an array of these elements, and then we iterate over them, calling the read_comic function and passing it the title (the innerHTML of the link, stripped of excess whitespace), and the URL (an absolute link to the href attribute of the link).

Next, let's beef up the read_comic function to do something useful. We'll have it make another remote connection, this time to the URL specified for the detailed comic description, parse out the talent, description, and some other information about the issue and stuff it into a model object. But first let's examine the source of one of those pages. The Trials of Shazam! #7 should do nicely.

We note in our examination of the page source that the data we want to scrape is all contained in tags, with different classes, as listed below. Note that this time we've chosen to use XPath-style syntax for the selectors. Note also that the span tag with class="display_copy" appears twice. The first time, it contains what appears to be the description of the issue, and the second time it lists the publication date. So instead of returning a single element, display_copy gets an Array of 2 (or possibly more) elements.

def read_comic(title, url)
  doc = Hpricot(open(url))
  display_talent = (doc/"span[@class=display_talent]").innerHTML
  display_copy = (doc/"span[@class=display_copy]") # 2 elements
  puts "====="
  puts "title: #{title}"
  puts "talent: #{display_talent}"
  puts "copy (0): #{display_copy[0].innerHTML}
  puts "copy (1): #{display_copy[1].innerHTML}
end

Now we're iterating through each book from the remote source, and dumping out it's title, the writer and artist responsible for it, a quick synopsis, and some other information (publication date, etc). Alright. If we just had a Comic model in our application, we could be somewhere!

So let's make one. In fact, let's do it in Ruby, with ActiveRecord. First the schema:

DROP DATABASE IF EXISTS comics;
CREATE DATABASE comics;
USE comics;

CREATE TABLE comics (
  id int(11) NOT NULL AUTO_INCREMENT,
  name VARCHAR(255),
  publisher VARCHAR(255),
  talent VARCHAR(255),
  description TEXT,
  published_on DATETIME,
  PRIMARY KEY (id)
);

Load this up and then add the following code to the top of your comics scraper. In fact, put it in a file called comics.rb so you can execute it on the command line.

require 'active_record'

ActiveRecord::Base.establish_connection(
  :adapter  => 'mysql',
  :host     => 'localhost',
  :username => 'root',
  :password => '',
  :database => 'comics')

class Comic < ActiveRecord::Base
end

Now we've established a connection to the database via ActiveRecord and defined a Comic model that inherits from ActiveRecord::Base, thus wrapping our database schema and giving us some handy getters and setters. Our next step will be to trade in the read_comic function in favor of an import class method on the Comic model.

class Comic < ActiveRecord::Base
  def self.import(title, url)
    doc = Hpricot(open(url))
    display_talent = (doc/"span[@class=display_talent]").innerHTML
    display_copy = (doc/"span[@class=display_copy]") # 2 elements

    comic = Comic.new(:name => title)
    comic.publisher = "DC"
    comic.talent = display_talent
    comic.description = display_copy[0].innerHTML
    comic.published_on = Date.parse(display_copy[1].innerHTML.
      sub('on sale', ''))

    comic
  end
end

When Comic.import receives a title and a URL it makes a connection to the URL specified and fires up Hpricot. It uses Hpricot to parse out the information we're looking for, and then instantiates an instance of the Comic class. We set the talent, the description (the first of the display_copy spans) and then parse the date out from the second display_copy span.

We'll remove all the output from there and put it in the book loop, since it's clearly not the job of the model code to be rendering a view of any sort. Our new book loop will use Comic.import on each element of the books Array, creating the model, saving it, and then printing out some attributes. Here's the final code for comics.rb:

require 'rubygems'
require 'active_record'
require 'open-uri'
require 'hpricot'

ActiveRecord::Base.establish_connection(
  :adapter  => 'mysql',
  :host     => 'localhost',
  :username => 'root',
  :password => '',
  :database => 'comics')

URL_DC = "http://www.dccomics.com/comics/"

class Comic < ActiveRecord::Base
  def self.import(title, url)
    doc = Hpricot(open(url))
    display_talent = (doc/"span[@class=display_talent]").innerHTML
    display_copy = (doc/"span[@class=display_copy]") # 2 elements?

    comic = Comic.new(:name => title)
    comic.publisher = "DC"
    comic.talent = display_talent
    comic.description = display_copy[0].innerHTML
    comic.published_on = Date.parse(display_copy[1].innerHTML.
      sub('on sale', ''))

    comic
  end
end

doc = Hpricot(open("#{URL_DC}?dat=#{Time.now.strftime('%y%m01')}"))
books = (doc/"a.contentLink")
books.each do |book|
  comic = Comic.import(book.innerHTML.strip, 
    "#{URL_DC}#{book.attributes['href']}")
  if comic.save
    puts "====="
    puts "name: #{comic.name}"
    puts "description: #{comic.description}"
    puts "release date: #{comic.published_on}"
  else
    puts "uh-oh! we should handle errors!"
  end
end

And here's the final result:

=====
name: TRIALS OF SHAZAM! #7 (OF 12)
description: Freddy must find Hercules for his next trial, 
which is considerably more difficult than he expected, 
since Herc is behind bars!
release date: 2007-06-13

Obviously we can do a lot more with this. We can build a series model, that has_many issues or episodes. We can build a publisher model. We can suck in the images and use RMagick to generate thumbnails. We can discriminate between graphic novels, trade paperbacks, and issues of a standard series book. We can roll this into a Rails application, and allow the results to be browsable, users to add comics to their pull lists, create collections, comment on them, rate them, and so on. Actually, that's exactly what I'm working on for my hobby project (if you're interested, email me and I'll let you take a look -- I'm hoping to release it relatively soon-ish).

To go further with scraping, we'll need to pay particular attention to handling errors, because it's an inexact science and, since we have no hard format, things are subject to change or break in weird ways. That's the obvious downside to scraping. But when you have no other alternative for automating mass import of data like in this scenario, it's certainly a good thing to know how to do.

If you want to learn more, _why's Hpricot site is chock full of useful information, and you may also want to check out scRUBYt, which combines Hpricot and WWW::Mechanize into a full-on web scraping "toolkit".

0 comments

Thanks Akismet!

June 22, 2007 @ 07:48 AM by nap · 0 comments

I'm probably a bit behind the game on this one, but huge props are due to Akismet for making my blog life just a bit more pleasant. Before it was installed last week, I was deleting hoards of comment spam every day. Today, none.

In other blog-related news, I'm still planning on moving productions over to Mephisto, but have been hard pressed for time lately. Fortunately (when I get around to it), it has Akismet support baked right in.

UPDATE: finally moved over to a new blogging platform! About time, eh?

0 comments

Tumbleranting

June 18, 2007 @ 01:46 PM by nap · 0 comments

Seth Godin is a nicer guy than I am (he's probably better dressed, too): I would have ballparked his quote a little higher. Explanation: there's a "don't waste my time" fee.

As a freelancer, this happens to me all the time and it's muy frustrating. I like hearing ideas, I like helping you structure your approach, I love developing solutions, applications, tools for you. I don't even mind giving estimates and free advice. But in order to do that, you need to tell me what it is you want.

And no, "just like digg but with/for xxx" isn't what I'm talking about :-).

0 comments

JRuby / Goldspike / Glassfish Deployment Diary

June 11, 2007 @ 12:01 PM by nap · 0 comments

Robert Dempsey has written a pretty solid little tutorial on deploying your first JRuby on Rails app with Glassfish. It's powerful stuff. Go read it now, damnit!

The process can still be a bit tricky the first time, especially if you have additional gem dependencies, etc. But once you get it running you'll be blown away by how simple it is to create a .war and deploy it to any of numerous pre-existing Java application servers (Glassfish).

I had a few issues initially (particularly with openssl support), so I figured I'd document them as an addenum to Robert's tutorial in case you're interested. Read on to see my notes...

  • Make sure to set JRUBY_HOME in your environment. It's used by Goldspike.
  • If both ruby and jruby are in your path, you can specify the 'version' of rake to use by doing: jruby -S rake. This will run the specific command in the JRUBY_HOME/bin directory.
  • Edit the goldspike lib/war_config.rb and change the line that reference jruby-complete version 0.99 to read:

    add_java_library(maven_library ('org.jruby', 'jruby-complete', '1.0'))

  • This will look for JRuby 1.0 instead of 0.9.9. As of this writing you'll also have to manually retrieve jruby-complete-1.0.jar as the remote sources don't seem to have it yet. Put it in JRUBY_HOME/lib.

  • I happen to be using SSL in my app, so I had to gem install it into the JRuby environment using jruby -S gem install jruby-openssl Make sure the gem ends up in your JRUBY_HOME/lib/ruby/gems hierarchy. Tell goldspike it needs to add this gem with the following line: add_gem('jruby-openssl')

  • If you're doing the jruby-openssl thing you'll also have to retrieve the latest version of the Bouncy Castle Crypto APIs package for whatever version of Java you're using (I'm on OS X, Java 1.5). Put this in your JRUBY_HOME/lib directory and then add the library to your war_config.rb file:

    add_java_library(maven_library('bouncycastle', 'bcprov-jdk15', '136'))

UPDATES:

  • Make sure to copy jdbc_databases.rake from ActiveRecord-JDBC to your lib/tasks directory (for ActiveRecord-related rake tasks). Thanks to Ola Bini for pointing me in the right direction there.

  • Include the following in environment.rb:

    if RUBY_PLATFORM =~ /java/ require 'rubygems' RAILS_CONNECTION_ADAPTERS = %w(jdbc) end

0 comments

JRuby Reaches 1.0

June 10, 2007 @ 09:54 AM by nap · 0 comments

JRuby 1.0 has been officially released. This is huge news. Complete compatibility with Ruby 1.8 is pretty much a done deal. Rails apps run. You can deploy your Rails apps in Glassfish. I mean, how cool is that?

You should check out team member Charles Nutter's blog for more details. In a nutshell, the future looks like tighter integration with Java as an application platform, and a major focus on optimizing performance. Also a lot of interest in Ruby 1.9/2.0. If you attended Railsconf this year, you most likely got pretty excited about JRuby, and for good reason.

If you've never tried it out before, Atlantic Dominion Solutions has put together a great tutorial on getting started with JRuby on OS X. Or, even easier, go get yourself a copy of NetBeans 6.0 M9, which comes bundled with JRuby as the default interpreter for in-IDE development.

0 comments

NHRuby Meeting Tuesday 06.12

June 08, 2007 @ 10:35 AM by nap · 0 comments

Usually the NHRuby meets are scheduled for the third Tuesday of the month, but we had to change it up this time due to some schedule conflicts.

I'll be speaking about integrating a Flash widget with Rails-based RESTful web services and also giving a brief overview of how to contribute to Rails and other open source projects (I guess you could say Josh Susser's talk at Railsconf inspired me). Scott will be doing an overview of the tabnav plugin and... something else, too.

More information and directions can be found at the NHRuby Wiki. Also, many thanks to our new host/sponsor for the meetings, RMC Research.

0 comments

LOLCODE

June 06, 2007 @ 02:48 PM by nap · 0 comments

[3:44pm] strager: BUKKIT is the array type.
[3:44pm] zapnap: O RLY?

0 comments

has_much :confusion, :about => "string #{interpolation}"

June 05, 2007 @ 09:15 AM by nap · 0 comments

No matter how flexible ActiveRecord's associations become, there's always going to be a time when you want to override it's baked-in smarts with your own custom SQL query. Fortunately, we can do just that with the :finder_sql option. Use it to manually specify the association that should be returned.

There's one gotcha to be aware of though: if you need to do any variable interpolation in the string (and you most likely will), make sure to use single quotes instead of the usual doubles:

has_many :transfers, :finder_sql => 
  'SELECT * FROM transfers ' +
  'WHERE sender_id = #{id} or receiver_id = #{id}'

When you use double quotes, the string interpolation happens immediately (when the class is first loaded), thus obtaining the object ID of the class in memory. Using single quotes, the interpolation occurs within the context of your object instance, which is what you're expecting.

0 comments