NAME Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions SYNOPSIS use URI; use Web::Scraper; use Encode; # First, create your scraper block my $authors = scraper { # Parse all TDs inside 'table[width="100%]"', store them into # an array 'authors'. We embed other scrapers for each TD. process 'table[width="100%"] td', "authors[]" => scraper { # And, in each TD, # get the URI of "a" element process "a", uri => '@href'; # get text inside "small" element process "small", fullname => 'TEXT'; }; }; my $res = $authors->scrape( URI->new("http://search.cpan.org/author/?A") ); # iterate the array 'authors' for my $author (@{$res->{authors}}) { # output is like: # Andy Adler http://search.cpan.org/~aadler/ # Aaron K Dancygier http://search.cpan.org/~aakd/ # Aamer Akhter http://search.cpan.org/~aakhter/ print Encode::encode("utf8", "$author->{fullname}\t$author->{uri}\n"); } The structure would resemble this (visually) { authors => [ { fullname => $fullname, link => $uri }, { fullname => $fullname, link => $uri }, ] } DESCRIPTION Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent Scrapi. It provides a DSL-ish interface for traversing HTML documents and returning a neatly arranged Perl data structure. The *scraper* and *process* blocks provide a method to define what segments of a document to extract. It understands HTML and CSS Selectors as well as XPath expressions. METHODS scraper $scraper = scraper { ... }; Creates a new Web::Scraper object by wrapping the DSL code that will be fired when *scrape* method is called. scrape $res = $scraper->scrape(URI->new($uri)); $res = $scraper->scrape($html_content); $res = $scraper->scrape(\$html_content); $res = $scraper->scrape($http_response); $res = $scraper->scrape($html_element); Retrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings and creates a DOM object, then fires the callback scraper code to retrieve the data structure. If you pass URI or HTTP::Response object, Web::Scraper will automatically guesses the encoding of the content by looking at Content-Type headers and META tags. Otherwise you need to decode the HTML to Unicode before passing it to *scrape* method. You can optionally pass the base URL when you pass the HTML content as a string instead of URI or HTTP::Response. $res = $scraper->scrape($html_content, "http://example.com/foo"); This way Web::Scraper can resolve the relative links found in the document. process scraper { process "tag.class", key => 'TEXT'; process '//tag[contains(@foo, "bar")]', key2 => '@attr'; process '//comment()', 'comments[]' => 'TEXT'; }; *process* is the method to find matching elements from HTML with CSS selector or XPath expression, then extract text or attributes into the result stash. If the first argument begins with "//" or "id(" it's treated as an XPath expression and otherwise CSS selector. # 2008/12/21 # date => "2008/12/21" process ".date", date => 'TEXT'; #
# link => URI->new("http://example.com/") process ".body > a", link => '@href'; # # comment => " HTML Comment here " # # NOTES: A comment nodes are accessed when installed # the HTML::TreeBuilder::XPath (version >= 0.14) and/or # the HTML::TreeBuilder::LibXML (version >= 0.13) process "//div[contains(@class, 'body')]/comment()", comment => 'TEXT'; # # link => URI->new("http://example.com/"), text => "foo" process ".body > a", link => '@href', text => 'TEXT'; #