Ruby Gem Nokogiri XPath Example: Crawling Mashable And Techcrunch

20 Jan

In the past few days, I’ve been working on a web crawler for various wordpress websites. My goal was to get an title, author, date, and link to every single blog post on Mashable and Techcrunch. It took me a few tries of playing around with XPath to get the awesome Ruby Nokogiri gem to work, so I wanted to share my work with others. Here is how I crawled Mashable and Techcrunch:


You can also view these examples on github.

Mashable

=begin
  The Mashable crawler crawls Mashable and gets the Title, URL, Date, and Author for each post
=end

require 'open-uri'
require 'nokogiri'
require 'date'
require 'cgi'

private
def get_post_details(doc)
  #an array of hashes with each posts title, url, date, author
  posts = []

  doc.css('div.post_content').each do |post|

    details = post.xpath('.//a')[0].attributes()

    #get url
    url = details['href'].value()
    #verify that the url matches mashable.com and is not feedproxy.google.com
    next unless url =~ /mashable\.com/
    puts url

    #get and parse title
    title = details['title'].value() if details['title'] != nil
    title = title.gsub(/Permanent Link to /, '')
    title = CGI.escape( title ) #make sure title is encoded
    puts title

    #get published date
    next if post.xpath('.//time')[0] == nil #avoid 'sponsored posts'

    date_details = post.xpath('.//time')[0].attributes()
    date_string = date_details['datetime'].value()
    date = DateTime.parse(date_string) #creates a time object
    puts date.to_s

    #get author
    author_details = post.xpath('.//span')[0].children()
    author = author_details.children().inner_text()
    puts author

    #store the data in a hash
    posts << {'title' => title, 'date' => date, 'author' => author, 'url' => url}
  end
  posts #returns an array of hashes with details for each post
end

public
def crawl_mashable
  #keeps track of page number
  page = 1

  #loop through each page on Mashable, and get the author, title, url, and date for each post
  loop do
    #get post details per page
    url = "http://mashable.com/page/#{page}/"
    doc = Nokogiri::HTML(open(url))
    posts = get_post_details(doc)
    p posts

    #you've crawled all the pages if the returned posts array is empty
    break if posts == []

    #move on to the next page
    page += 1
  end
end

Techcrunch

=begin
  The Techcrunch crawler crawls Mashable and gets the Title, URL, Date, and Author for each post
=end

require 'open-uri'
require 'nokogiri'
require 'date'
require 'cgi'

def get_post_details(doc)

  #get title and url
  title_and_url = []
  doc.css('h2.headline').each do |headline|
    details = headline.xpath('.//a')[0].attributes

    #url encoded
    url = CGI.escape( details['href'].value() )

    #title parsing
    title = details['title'].value()
    title = title.gsub(/\"/, '\'')
    title = CGI.escape( title )
    title_and_url << {'title' => title, 'url' => url}
  end

  #get author and date
  author_and_date = []
  doc.css('div.publication-info').each do |post|
    #author
    author_details = post.xpath('.//div')[0].children
    begin
      author = author_details.children[0].content
      author = CGI.escape(author)
    rescue Exception
      author = author_details[0].content
      author = CGI.escape(author)
    end

    #post date / time
    date_details = post.xpath('.//div')[1].children
    date_string = date_details[0].content
    begin
      date = DateTime.parse(date_string)
    rescue Exception
      date = Time.now - 86400
    end

    author_and_date << {'author' => author, 'date' => date}
  end

  #combine the title and url with author and date
  posts = []
  title_and_url.each.with_index do |post, i|
    post.store('publication_id', '4f0f8f4b9ff088c9a1000002')
    post.update(author_and_date[i])
    posts << post
  end

  #returns an array of hashes with details for each post
  posts
end

public
def crawl_techcrunch
  #keeps track of Techcrunch page number
  page = 1

  #loop through each page on Techcrunch and get the author, title, url, and date for each post
  loop do
    #get post details per page
    url = "http://techcrunch.com/page/#{page}/"
    doc = Nokogiri::HTML(open(url))
    posts = get_post_details(doc)
    p posts

    #you've crawled all the pages if the returned posts array is empty
    break if posts == []

    #move on to the next page
    page += 1
  end
end

Let me know if there is a better / simpler way to do any of these in the comments 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s