How to most efficiently parse a web page using Node.js

I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.

P.S. This is the site I'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.

Answers


I have done this a lot. You'll want to use PhantomJS if the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-node or node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.

What you should use is Cheerio in conjunction with Request. This will be sufficient for most web pages.

I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.js But, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.

Hope this helps.

Snippet using Request and Cheerio:

var request = require('request')
  , cheerio = require('cheerio');

var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;

request(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('.sb_tlst h3 a'); //use your CSS selector here
  $(links).each(function(i, link){
    console.log($(link).text() + ':\n  ' + $(link).attr('href'));
  });
});

You could try PhantomJS. Here's the documentation for using it for screen scraping.


I agree with @JP Richardson that Cheerio is best for scraping non-JS-heavy sites. For JS-heavy sites, use Casper. It provides great abstractions over Phantom and a promises-style API. They go over how to scrape in their docs: http://docs.casperjs.org/en/latest/quickstart.html.


If you want to go for phantom, use node-phantom. I have a git hub repository using them together to generate pdf files from html if you want to have a look. But i wouldn't go for phantom because it does more than what you usually want and cheerio is faster.


Need Your Help

Search by date using command line

search date find cmd dir

Is there any way to search for files in a directory based on date? I want to find all files with created date greater than a specific date, is it possible to do it with dir command?

ViewPager PagerAdapter not updating the View

android android-viewpager

I'm using the ViewPager from the compatibility library. I have succussfully got it displaying several views which I can page through.