[fusion_youtube id=”y8UhYZrFE8U” width=”640″ height=”360″ autoplay=”false” api_params=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=””][/fusion_youtube]
I had an issue where I needed to scrape some data from a dynamically generated webpage. Open this page in a browser and you’ll see what I mean.
I tried a lot of options, and the solution that ended up working for me was to use Headless Chrome. This basically allows you to launch Chrome and use it as a tool from within your code. It’s recommended to install Chrome Canary and you can get it from here.
Essentially the pseudo code for my script goes as follows..
- launch Chrome in headless mode (no visible window)
- load a page and wait until the page is loaded
- then wait a little longer until all the JS on the page has completed
- then click on something using a querySelector to change the language to English
- then grab all the source code from the page
- then load that source code into cheerio and perform queries as needed
The full source code for this is below. You’ll of course need to install the 3 packages too.
> yarn add chrome-remote-interface chrome-launcher cheerio
https://gist.github.com/magician11/a979906401591440bd6140bd14260578
I hope that helps some people! It took me ages to finally get to this place where I could scrape a dynamically generated webpage from Node.js.
2 responses to “How to grab the page source from any dynamically generated webpage and then process it”
how to loop through , if the website has multiple pages or if I like to pass a array of urls
Hi Yogesh,
I’ve currently only focused on grabbing one specific page, and then processing that source code with cheerio. You could of course encapsulate this code into a function, and then pass in a URL. So you could call that function from a loop that iterated through an array of URLs.
You can see the code for a module I built that uses this approach here https://github.com/magician11/sfcinemacity/blob/master/src/index.js
If I was going to crawl through a site, I would almost definitely be using the function Page.navigate (https://chromedevtools.github.io/devtools-protocol/tot/Page#method-navigate)