I had an issue where I needed to scrape some data from a dynamically generated webpage. Open this page in a browser and you’ll see what I mean.
I tried a lot of options, and the solution that ended up working for me was to use Headless Chrome. This basically allows you to launch Chrome and use it as a tool from within your code. It’s recommended to install Chrome Canary and you can get it from here.
Essentially the pseudo code for my script goes as follows..
- launch Chrome in headless mode (no visible window)
- load a page and wait until the page is loaded
- then wait a little longer until all the JS on the page has completed
- then click on something using a querySelector to change the language to English
- then grab all the source code from the page
- then load that source code into cheerio and perform queries as needed
The full source code for this is below. You’ll of course need to install the 3 packages too.
> yarn add chrome-remote-interface chrome-launcher cheerio
I hope that helps some people! It took me ages to finally get to this place where I could scrape a dynamically generated webpage from Node.js.