Module Scraper::Reader
In: lib/scraper/reader.rb

Methods

Classes and Modules

Class Scraper::Reader::HTMLParseError
Class Scraper::Reader::HTTPError
Class Scraper::Reader::HTTPInvalidURLError
Class Scraper::Reader::HTTPNoAccessError
Class Scraper::Reader::HTTPNotFoundError
Class Scraper::Reader::HTTPRedirectLimitError
Class Scraper::Reader::HTTPTimeoutError
Class Scraper::Reader::HTTPUnspecifiedError

Constants

REDIRECT_LIMIT = 3
DEFAULT_TIMEOUT = 30
PARSERS = [:tidy, :html_parser]
TIDY_OPTIONS = { :output_xhtml=>true, :show_errors=>0, :show_warnings=>false, :wrap=>0, :wrap_sections=>false, :force_output=>true, :quiet=>true, :tidy_mark=>false
Page = Struct.new(:url, :content, :encoding, :last_modified, :etag)
Parsed = Struct.new(:document, :encoding)

Public Instance methods

Parses an HTML page and returns the encoding and HTML element. Raises HTMLParseError exceptions if it cannot parse the HTML.

Options are passed to the parser. For example, when using Tidy you can pass Tidy cleanup options in the hash.

The last option specifies which parser to use (see PARSERS). By default Tidy is used.

[Source]

     # File lib/scraper/reader.rb, line 189
189:     def parse_page(content, encoding = nil, options = nil, parser = :tidy)
190:       begin
191:         # Get the document encoding from the meta header.
192:         if meta = content.match(/(<meta\s*([^>]*)http-equiv=['"]?content-type['"]?([^>]*))/i)
193:           if meta = meta[0].match(/charset=([\w-]*)/i)
194:             encoding = meta[1]
195:           end
196:         end
197:         encoding ||= "utf8"
198:         case (parser || :tidy)
199:         when :tidy
200:           # Make sure the Tidy path is set and always apply the default
201:           # options (these only control things like errors, output type).
202:           find_tidy
203:           options = (options || {}).update(TIDY_OPTIONS)
204:           options[:input_encoding] = encoding.gsub("-", "").downcase
205:           document = Tidy.open(options) do |tidy|
206:             html = tidy.clean(content)
207:             HTML::Document.new(html).find(:tag=>"html")
208:           end
209:         when :html_parser
210:           document = HTML::HTMLParser.parse(content).root
211:         else
212:           raise HTMLParseError, "No parser #{parser || "unspecified"}"
213:         end
214:         return Parsed[document, encoding]
215:       rescue Exception=>error
216:         raise HTMLParseError.new(error)
217:       end
218:     end

Reads a Web page and return its URL, content and cache control headers.

The request reads a Web page at the specified URL (must be a URI object). It accepts the following options:

  • :last_modified — Last modified header (from a previous request).
  • :etag — ETag header (from a previous request).
  • :redirect_limit — Number of redirects allowed (default is 3).
  • :user_agent — The User-Agent header to send.
  • :timeout — HTTP open connection/read timeouts (in second).

It returns a hash with the following information:

  • :url — The URL of the requested page (may change by permanent redirect)
  • :content — The content of the response (may be nil if cached)
  • :content_type — The HTML page Content-Type header
  • :last_modified — Last modified cache control header (may be nil)
  • :etag — ETag cache control header (may be nil)
  • :encoding — Document encoding for the page

If the page has not been modified from the last request, the content is nil.

Raises HTTPError if an error prevents it from reading the page.

[Source]

     # File lib/scraper/reader.rb, line 109
109:     def read_page(url, options = nil)
110:       options ||= {}
111:       redirect_limit = options[:redirect_limit] || REDIRECT_LIMIT
112:       raise HTTPRedirectLimitError if redirect_limit == 0
113:       if url.is_a?(URI)
114:         uri = url
115:       else
116:         begin
117:           uri = URI.parse(url)
118:         rescue Exception=>error
119:           raise HTTPInvalidURLError.new(error)
120:         end
121:       end
122:       raise HTTPInvalidURLError unless uri.scheme =~ /^http(s?)$/
123:       begin
124:         http = Net::HTTP.new(uri.host, uri.port)
125:         http.use_ssl = (uri.scheme == "https")
126:         http.close_on_empty_response = true
127:         http.open_timeout = http.read_timeout = options[:http_timeout] || DEFAULT_TIMEOUT
128:         path = uri.path.dup # required so we don't modify path
129:         path << "?#{uri.query}" if uri.query
130:         # TODO: Specify which content types are accepted.
131:         # TODO: GZip support.
132:         headers = {}
133:         headers["User-Agent"] = options[:user_agent] if options[:user_agent]
134:         headers["Last-Modified"] = options[:last_modified] if options[:last_modified]
135:         headers["ETag"] = options[:etag] if options[:etag]
136:         response = http.request_get(path, headers)
137:         # TODO: Ignore content types that do not map to HTML.
138:       rescue TimeoutError=>error
139:         raise HTTPTimeoutError.new(error)
140:       rescue Exception=>error
141:         raise HTTPUnspecifiedError.new(error)
142:       end
143:       case response
144:       when Net::HTTPSuccess
145:         encoding = if content_type = response["Content-Type"]
146:           if match = content_type.match(/charset=([^\s]+)/i)
147:             match[1]
148:           end
149:         end
150:         return Page[(options[:source_url] || uri), response.body, encoding,
151:                     response["Last-Modified"], response["ETag"]]
152:       when Net::HTTPNotModified
153:         return Page[(options[:source_url] || uri), nil, nil,
154:                     options[:last_modified], options[:etag]]
155:       when Net::HTTPMovedPermanently
156:         return read_page(response["location"], # New URL takes effect
157:                          :last_modified=>options[:last_modified],
158:                          :etag=>options[:etag],
159:                          :redirect_limit=>redirect_limit-1)
160:       when Net::HTTPRedirection
161:         return read_page(response["location"],
162:                          :last_modified=>options[:last_modified],
163:                          :etag=>options[:etag],
164:                          :redirect_limit=>redirect_limit-1,
165:                          :source_url=>(options[:source_url] || uri)) # Old URL still in effect
166:       when Net::HTTPNotFound
167:         raise HTTPNotFoundError
168:       when Net::HTTPUnauthorized, Net::HTTPForbidden
169:         raise HTTPNoAccessError
170:       when Net::HTTPRequestTimeOut
171:         raise HTTPTimeoutError
172:       else
173:         raise HTTPUnspecifiedError
174:       end
175:     end

Protected Instance methods

[Source]

     # File lib/scraper/reader.rb, line 224
224:     def find_tidy()
225:       return if Tidy.path
226:       begin
227:         Tidy.path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.so")
228:       rescue LoadError
229:         begin
230:           Tidy.path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.dll")
231:         rescue LoadError
232:           Tidy.path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.dylib")
233:         end
234:       end
235:     end

[Validate]