How to grab the page source from any dynamically generated webpage and then process it

 

I had an issue where I needed to scrape some data from a dynamically generated webpage. Open this page in a browser and you’ll see what I mean.

I tried a lot of options, and the solution that ended up working for me was to use Headless Chrome. This basically allows you to launch Chrome and use it as a tool from within your code. It’s recommended to install Chrome Canary and you can get it from here.

Essentially the pseudo code for my script goes as follows..

  1. launch Chrome in headless mode (no visible window)
  2. load a page and wait until the page is loaded
  3. then wait a little longer until all the JS on the page has completed
  4. then click on something using a querySelector to change the language to English
  5. then grab all the source code from the page
  6. then load that source code into cheerio and perform queries as needed

The full source code for this is below. You’ll of course need to install the 3 packages too.

> yarn add chrome-remote-interface chrome-launcher cheerio

I hope that helps some people! It took me ages to finally get to this place where I could scrape a dynamically generated webpage from Node.js.

By |2017-07-08T06:22:49+00:00July 8th, 2017|gist|2 Comments

About the Author:

Andrew Golightly is the lead web developer here at Golightly+. He is a passionate fullstack JavaScript developer. And creates native apps too using React Native. To balance his love for coding, he also works as a counsellor.

2 Comments

  1. Yogesh Bansal January 29, 2018 at 10:16 pm - Reply

    how to loop through , if the website has multiple pages or if I like to pass a array of urls

Leave A Comment