Class Scraper::Base
In: lib/scraper/base.rb
Parent: Object

Methods

array   collect   document   element   extractor   new   option   options   parser   parser_options   prepare   process   process_first   request   result   result   root_element   rules   scrape   scrape   selector   skip   stop   text  

Constants

PageInfo = Struct.new(:url, :original_url, :encoding, :last_modified, :etag)   Information about the HTML page scraped. A structure with the following attributes:
  • url — The URL of the document being scraped. Passed in the constructor but may have changed if the page was redirected.
  • original_url — The original URL of the document being scraped as passed in the constructor.
  • encoding — The encoding of the document.
  • last_modified — Value of the Last-Modified header returned from the server.
  • etag — Value of the Etag header returned from the server.
READER_OPTIONS = [:last_modified, :etag, :redirect_limit, :user_agent, :timeout]

Attributes

extracted  [RW]  Set to true when the first extractor returns true.
options  [RW]  Returns the options for this object.
page_info  [RW]  Information about the HTML page scraped. See PageInfo.

Public Class methods

Declares which accessors are arrays. You can declare the accessor here, or use "symbol[]" as the target.

For example:

  array :urls
  process "a[href]", :urls=>"@href"

Is equivalent to:

  process "a[href]", "urls[]"=>"@href"

[Source]

     # File lib/scraper/base.rb, line 473
473:       def array(*symbols)
474:         @arrays ||= []
475:         symbols.each do |symbol|
476:           symbol = symbol.to_sym
477:           @arrays << symbol
478:           begin
479:             self.instance_method(symbol)
480:           rescue NameError
481:             attr_accessor symbol
482:           end
483:         end
484:       end

Returns the element itself.

You can use this method from an extractor, e.g.:

  process "h1", :header=>:element

[Source]

     # File lib/scraper/base.rb, line 373
373:       def element(element)
374:         element
375:       end

Creates an extractor that will extract values from the selected element and place them in instance variables of the scraper. You can pass the result to process.

Example

This example processes a document looking for an element with the class name article. It extracts the attribute id and stores it in the instance variable +@id+. It extracts the article node itself and puts it in the instance variable +@article+.

  class ArticleScraper < Scraper::Base
    process ".article", extractor(:id=>"@id", :article=>:element)
    attr_reader :id, :article
  end
  result = ArticleScraper.scrape(html)
  puts result.id
  puts result.article

Sources

Extractors operate on the selected element, and can extract the following values:

  • "elem_name" — Extracts the element itself if it matches the element name (e.g. "h2" will extract only level 2 header elements).
  • "attr_name" — Extracts the attribute value from the element if specified (e.g. "@id" will extract the id attribute).
  • "elem_name@attr_name" — Extracts the attribute value from the element if specified, but only if the element has the specified name (e.g. "h2@id").
  • :element — Extracts the element itself.
  • :text — Extracts the text value of the node.
  • Scraper — Using this class creates a scraper to process the current element and extract the result. This can be used for handling complex structure.

If you use an array of sources, the first source that matches anything is used. For example, ["attr@title", :text] extracts the value of the title attribute if the element is abbr, otherwise the text value of the element.

If you use a hash, you can extract multiple values at the same time. For example, {:id=>"@id", :class=>"@class"} extracts the id and class attribute values.

:element and :text are special cases of symbols. You can pass any symbol that matches a class method and that class method will be called to extract a value from the selected element. You can also pass a Proc or Method directly.

And it’s always possible to pass a static value, quite useful for processing an element with more than one rule (:skip=>false).

Targets

Extractors assign the extracted value to an instance variable of the scraper. The instance variable contains the last value extracted.

Also creates an accessor for that instance variable. An accessor is created if no such method exists. For example, :title=>:text creates an accessor for title. However, :id=>"@id" does not create an accessor since each object already has a method called id.

If you want to extract multiple values into the same variables, use array to declare that accessor as an array.

Alternatively, you can append [] to the variable name. For example:

  process "*", "ids[]"=>"@id"
  result :ids

The special target :skip allows you to control whether other rules can apply to the same element. By default a processing rule without a block (or a block that returns true) will skip that element so no other processing rule sees it.

You can change this with :skip=>false.

[Source]

     # File lib/scraper/base.rb, line 283
283:       def extractor(map)
284:         extracts = []
285:         map.each_pair do |target, source|
286:           source = extract_value_from(source)
287:           target = extract_value_to(target)
288:           define_method :__extractor do |element|
289:             value = source.call(element)
290:             target.call(self, value) if !value.nil?
291:           end
292:           extracts << instance_method(:__extractor)
293:           remove_method :__extractor
294:         end
295:         lambda do |element|
296:           extracts.each do |extract|
297:             extract.bind(self).call(element)
298:           end
299:           true
300:         end
301:       end

Create a new scraper instance.

The argument source is a URL, string containing HTML, or HTML::Node. The optional argument options are options passed to the scraper. See Base#scrape for more details.

For example:

  # The page we want to scrape
  url = URI.parse("http://example.com")
  # Skip the header
  scraper = MyScraper.new(url, :root_element=>"body")
  result = scraper.scrape

[Source]

     # File lib/scraper/base.rb, line 715
715:     def initialize(source, options = nil)
716:       @page_info = PageInfo[]
717:       @options = options || {}
718:       case source
719:       when URI
720:         @document = source
721:       when String, HTML::Node
722:         @document = source
723:         # TODO: document and test case these two.
724:         @page_info.url = @page_info.original_url = @options[:url]
725:         @page_info.encoding = @options[:encoding]
726:       else
727:         raise ArgumentError, "Can only scrape URI, String or HTML::Node"
728:       end
729:     end

Returns the options for this class.

[Source]

     # File lib/scraper/base.rb, line 412
412:       def options()
413:         @options ||= {}
414:       end

Specifies which parser to use. The default is +:tidy+.

[Source]

     # File lib/scraper/base.rb, line 379
379:       def parser(name = :tidy)
380:         self.options[:parser] = name
381:       end

Options to pass to the parser.

For example, when using Tidy, you can use these options to tell Tidy how to clean up the HTML.

This method sets the option for the class. Classes inherit options from their parents. You can also pass options to the scraper object itself using the +:parser_options+ option.

[Source]

     # File lib/scraper/base.rb, line 392
392:       def parser_options(options)
393:         self.options[:parser_options] = options
394:       end

Defines a processing rule. A processing rule consists of a selector that matches element, and an extractor that does something interesting with their value.

Symbol

Rules are processed in the order in which they are defined. Use rules if you need to change the order of processing.

Rules can be named or anonymous. If the first argument is a symbol, it is used as the rule name. You can use the rule name to position, remove or replace it.

Selector

The first argument is a selector. It selects elements from the document that are potential candidates for extraction. Each selected element is passed to the extractor.

The selector argument may be a string, an HTML::Selector object or any object that responds to the select method. Passing an Array (responds to select) will not do anything useful.

String selectors support value substitution, replacing question marks (?) in the selector expression with values from the method arguments. See HTML::Selector for more information.

Extractor

The last argument or block is the extractor. The extractor does something interested with the selected element, typically assigns it to an instance variable of the scraper.

Since the extractor is called on the scraper, it can also use the scraper to maintain state, e.g. this extractor counts how many div elements appear in the document:

  process "div" { |element| @count += 1 }

The extractor returns true if the element was processed and should not be passed to any other extractor (including any child elements).

The default implementation of result returns self only if at least one extractor returned true. However, you can override result and use extractors that return false.

A block extractor is called with a single element.

You can also use the extractor method to create extractors that assign elements, attributes and text values to instance variables, or pass a Hash as the last argument to process. See extractor for more information.

When using a block, the last statement is the response. Do not use return, use next if you want to return a value before the last statement. return does not do what you expect it to.

Example

  class ScrapePosts < Scraper::Base
    # Select the title of a post
    selector :select_title, "h2"

    # Select the body of a post
    selector :select_body, ".body"

    # All elements with class name post.
    process ".post" do |element|
      title = select_title(element)
      body = select_body(element)
      @posts << Post.new(title, body)
      true
    end

    attr_reader :posts
  end

  posts = ScrapePosts.scrape(html).posts

To process only a single element:

  class ScrapeTitle < Scraper::Base
    process "html>head>title", :title=>text
    result :title
  end

  puts ScrapeTitle.scrape(html)

[Source]

     # File lib/scraper/base.rb, line 123
123:       def process(*selector, &block)
124:         create_process(false, *selector, &block)
125:       end

Similar to process, but only extracts from the first selected element. Faster if you know the document contains only one applicable element, or only interested in processing the first one.

[Source]

     # File lib/scraper/base.rb, line 132
132:       def process_first(*selector, &block)
133:         create_process(true, *selector, &block)
134:       end

Modifies this scraper to return a single value or a structure. Use in combination with accessors.

When called with one symbol, scraping returns the result of calling that method (typically an accessor). When called with two or more symbols, scraping returns a structure of values, one for each symbol.

For example:

  class ScrapeTitle < Scraper::Base
    process_first "html>head>title", :title=>:text
    result :title
  end

  puts "Title: " + ScrapeTitle.scrape(html)

  class ScrapeDts < Scraper::Base
    process ".dtstart", :dtstart=>["abbr@title", :text]
    process ".dtend", :dtend=>["abbr@title", :text]
    result :dtstart, :dtend
  end

  dts = ScrapeDts.scrape(html)
  puts "Starts: #{dts.dtstart}"
  puts "Ends: #{dts.dtend}"

[Source]

     # File lib/scraper/base.rb, line 449
449:       def result(*symbols)
450:         raise ArgumentError, "Use one symbol to return the value of this accessor, multiple symbols to returns a structure" if symbols.empty?
451:         symbols = symbols.map {|s| s.to_sym}
452:         if symbols.size == 1
453:           define_method :result do
454:             return self.send(symbols[0])
455:           end
456:         else
457:           struct = Struct.new(*symbols)
458:           define_method :result do
459:             return struct.new(*symbols.collect {|s| self.send(s) })
460:           end
461:         end
462:       end

The root element to scrape.

The root element for an HTML document is html. However, if you want to scrape only the header or body, you can set the root_element to head or body.

This method sets the root element for the class. Classes inherit this option from their parents. You can also pass a root element to the scraper object itself using the +:root_element+ option.

[Source]

     # File lib/scraper/base.rb, line 406
406:       def root_element(name)
407:         self.options[:root_element] = name ? name.to_s : nil
408:       end

Returns an array of rules defined for this class. You can use this array to change the order of rules.

[Source]

     # File lib/scraper/base.rb, line 419
419:       def rules()
420:         @rules ||= []
421:       end

Scrapes the document and returns the result.

The first argument provides the input document. It can be one of:

  • URI — Retrieve an HTML page from this URL and scrape it.
  • String — The HTML page as a string.
  • HTML::Node — An HTML node, can be a document or element.

You can specify options for the scraper class, or override these by passing options in the second argument. Some options only make sense in the constructor.

The following options are supported for reading HTML pages:

  • :last_modified — Last-Modified header used for caching.
  • :etag — ETag header used for caching.
  • :redirect_limit — Limits number of redirects to follow.
  • :user_agent — Value for User-Agent header.
  • :timeout — HTTP open connection/read timeouts (in second).

The following options are supported for parsing the HTML:

  • :root_element — The root element to scrape, see also root_elements.
  • :parser — Specifies which parser to use. (Typically, you set this for the class).
  • :parser_options — Options to pass to the parser.

The result is returned by calling the result method. The default implementation returns self if any extractor returned true, nil otherwise.

For example:

  result = MyScraper.scrape(url, :root_element=>"body")

The method may raise any number of exceptions. HTTPError indicates it failed to retrieve the HTML page, and HTMLParseError that it failed to parse the page. Other exceptions come from extractors and the result method.

[Source]

     # File lib/scraper/base.rb, line 345
345:       def scrape(source, options = nil)
346:         scraper = self.new(source, options);
347:         return scraper.scrape
348:       end

Create a selector method. You can call a selector method directly to select elements.

For example, define a selector:

  selector :five_divs, "div" { |elems| elems[0..4] }

And call it to retrieve the first five div elements:

  divs = five_divs(element)

Call a selector method with an element and it returns an array of elements that match the selector, beginning with the element argument itself. It returns an empty array if nothing matches.

If the selector is defined with a block, all selected elements are passed to the block and the result of the block is returned.

For convenience, a first_ method is also created that returns (and yields) only the first selected element. For example:

  selector :post, "#post"
  @post = first_post

Since the selector is defined with a block, both methods call that block with an array of elements.

The selector argument may be a string, an HTML::Selector object or any object that responds to the select method. Passing an Array (responds to select) will not do anything useful.

String selectors support value substitution, replacing question marks (?) in the selector expression with values from the method arguments. See HTML::Selector for more information.

When using a block, the last statement is the response. Do not use return, use next if you want to return a value before the last statement. return does not do what you expect it to.

[Source]

     # File lib/scraper/base.rb, line 175
175:       def selector(symbol, *selector, &block)
176:         raise ArgumentError, "Missing selector: the first argument tells us what to select" if selector.empty?
177:         if selector[0].is_a?(String)
178:           selector = HTML::Selector.new(*selector)
179:         else
180:           raise ArgumentError, "Selector must respond to select() method" unless selector.respond_to?(:select)
181:           selector = selector[0]
182:         end
183:         if block
184:           define_method symbol do |element|
185:             selected = selector.select(element)
186:             return block.call(selected) unless selected.empty?
187:           end
188:           define_method "first_#{symbol}" do |element|
189:             selected = selector.select_first(element)
190:             return block.call([selected]) if selected
191:           end
192:         else
193:           define_method symbol do |element|
194:             return selector.select(element)
195:           end
196:           define_method "first_#{symbol}" do |element|
197:             return selector.select_first(element)
198:           end
199:         end
200:       end

Returns the text of the element.

You can use this method from an extractor, e.g.:

  process "title", :title=>:text

[Source]

     # File lib/scraper/base.rb, line 355
355:       def text(element)
356:         text = ""
357:         stack = element.children.reverse
358:         while node = stack.pop
359:           if node.tag?
360:             stack.concat node.children.reverse
361:           else
362:             text << node.content
363:           end
364:         end
365:         return text
366:       end

Public Instance methods

Called by scrape scraping the document, and before calling result. Typically used to run any validation, post-processing steps, resolving referenced elements, etc.

[Source]

     # File lib/scraper/base.rb, line 939
939:     def collect()
940:     end

Returns the document being processed.

If the scraper was created with a URL, this method will attempt to retrieve the page and parse it.

If the scraper was created with a string, this method will attempt to parse the page.

Be advised that calling this method may raise an exception (HTTPError or HTMLParseError).

The document is parsed only the first time this method is called.

[Source]

     # File lib/scraper/base.rb, line 856
856:     def document
857:       if @document.is_a?(URI)
858:         # Attempt to read page. May raise HTTPError.
859:         options = {}
860:         READER_OPTIONS.each { |key| options[key] = option(key) }
861:         request(@document, options)
862:       end
863:       if @document.is_a?(String)
864:         # Parse the page. May raise HTMLParseError.
865:         parsed = Reader.parse_page(@document, @page_info.encoding,
866:                                    option(:parser_options), option(:parser))
867:         @document = parsed.document
868:         @page_info.encoding = parsed.encoding
869:       end
870:       return @document if @document.is_a?(HTML::Node)
871:       raise RuntimeError, "No document to process"
872:     end

Returns the value of an option.

Returns the value of an option passed to the scraper on creation. If not specified, return the value of the option set for this scraper class. Options are inherited from the parent class.

[Source]

     # File lib/scraper/base.rb, line 967
967:     def option(symbol)
968:       return options.has_key?(symbol) ? options[symbol] : self.class.options[symbol]
969:     end

Called by scrape after creating the document, but before running any processing rules.

You can override this method to do any preparation work.

[Source]

     # File lib/scraper/base.rb, line 932
932:     def prepare(document)
933:     end

[Source]

     # File lib/scraper/base.rb, line 875
875:     def request(url, options)
876:       if page = Reader.read_page(@document, options)
877:         @page_info.url = page.url
878:         @page_info.original_url = @document
879:         @page_info.last_modified = page.last_modified
880:         @page_info.etag = page.etag
881:         @page_info.encoding = page.encoding
882:         @document = page.content
883:       end
884:     end

Returns the result of a succcessful scrape.

This method is called by scrape after running all the rules on the document. You can also call it directly.

Override this method to return a specific object, perform post-scraping processing, validation, etc.

The default implementation returns self if any extractor returned true, nil otherwise.

If you override this method, implement your own logic to determine if anything was extracted and return nil otherwise. Also, make sure calling this method multiple times returns the same result.

[Source]

     # File lib/scraper/base.rb, line 957
957:     def result()
958:       return self if @extracted
959:     end

Scrapes the document and returns the result.

If the scraper was created with a URL, retrieve the page and parse it. If the scraper was created with a string, parse the page.

The result is returned by calling the result method. The default implementation returns self if any extractor returned true, nil otherwise.

The method may raise any number of exceptions. HTTPError indicates it failed to retrieve the HTML page, and HTMLParseError that it failed to parse the page. Other exceptions come from extractors and the result method.

See also Base#scrape.

[Source]

     # File lib/scraper/base.rb, line 747
747:     def scrape()
748:       # Call prepare with the document, but before doing anything else.
749:       prepare document
750:       # Retrieve the document. This may raise HTTPError or HTMLParseError.
751:       case document
752:       when Array
753:         stack = @document.reverse # see below
754:       when HTML::Node
755:         # If a root element is specified, start selecting from there.
756:         # The stack is empty if we can't find any root element (makes
757:         # sense). However, the node we're going to process may be
758:         # a tag, or an HTML::Document.root which is the equivalent of
759:         # a document fragment.
760:         root_element = option(:root_element)
761:         root = root_element ? @document.find(:tag=>root_element) : @document
762:         stack = root ? (root.tag? ? [root] : root.children.reverse) : []
763:       else
764:         return
765:       end
766:       # @skip stores all the elements we want to skip (see #skip).
767:       # rules stores all the rules we want to process with this
768:       # scraper, based on the class definition.
769:       @skip = []
770:       @stop = false
771:       rules = self.class.rules.clone
772:       begin
773:         # Process the document one node at a time. We process elements
774:         # from the end of the stack, so each time we visit child elements,
775:         # we add them to the end of the stack in reverse order.
776:         while node = stack.pop
777:           break if @stop
778:           skip_this = false
779:           # Only match nodes that are elements, ignore text nodes.
780:           # Also ignore any element that's on the skip list, and if
781:           # found one, remove it from the list (since we never visit
782:           # the same element twice). But an element may be added twice
783:           # to the skip list.
784:           # Note: equal? is faster than == for nodes.
785:           next unless node.tag?
786:           @skip.delete_if { |s| skip_this = true if s.equal?(node) }
787:           next if skip_this
788: 
789:           # Run through all the rules until we process the element or
790:           # run out of rules. If skip_this=true then we processed the
791:           # element and we can break out of the loop. However, we might
792:           # process (and skip) descedants so also watch the skip list.
793:           rules.delete_if do |selector, extractor, rule_name, first_only|
794:             break if skip_this
795:             # The result of calling match (selected) is nil, element
796:             # or array of elements. We turn it into an array to
797:             # process one element at a time. We process all elements
798:             # that are not on the skip list (we haven't visited
799:             # them yet).
800:             if selected = selector.match(node, first_only)
801:               selected = [selected] unless selected.is_a?(Array)
802:               selected = [selected.first] if first_only
803:               selected.each do |element|
804:                 # Do not process elements we already skipped
805:                 # (see above). However, this time we may visit
806:                 # an element twice, since selected elements may
807:                 # be descendants of the current element on the
808:                 # stack. In rare cases two elements on the stack
809:                 # may pick the same descendants.
810:                 next if @skip.find { |s| s.equal?(element) }
811:                 # Call the extractor method with this element.
812:                 # If it returns true, skip the element and if
813:                 # the current element, don't process any more
814:                 # rules. Again, pay attention to descendants.
815:                 if extractor.bind(self).call(element)
816:                   @extracted = true
817:                 end
818:                 if @skip.delete(true)
819:                   if element.equal?(node)
820:                     skip_this = true
821:                   else
822:                     @skip << element
823:                   end
824:                 end
825:               end
826:               first_only if !selected.empty?
827:             end
828:           end
829: 
830:           # If we did not skip the element, we're going to process its
831:           # children. Reverse order since we're popping from the stack.
832:           if !skip_this && children = node.children
833:             stack.concat children.reverse
834:           end
835:         end
836:       ensure
837:         @skip = nil
838:       end
839:       collect
840:       return result
841:     end

Skips processing the specified element(s).

If called with a single element, that element will not be processed.

If called with an array of elements, all the elements in the array are skipped.

If called with no element, skips processing the current element. This has the same effect as returning true.

For convenience this method always returns true. For example:

  process "h1" do |element|
    @header = element
    skip
  end

[Source]

     # File lib/scraper/base.rb, line 907
907:     def skip(elements = nil)
908:       case elements
909:       when Array: @skip.concat elements
910:       when HTML::Node: @skip << elements
911:       when nil: @skip << true
912:       when true, false: @skip << elements
913:       end
914:       # Calling skip(element) as the last statement is
915:       # redundant by design.
916:       return true
917:     end

Stops processing this page. You can call this early on if you discover there is no interesting information on the page, or done extracting all useful information.

[Source]

     # File lib/scraper/base.rb, line 923
923:     def stop()
924:       @stop = true
925:     end

[Validate]