Principal Engineer, Indix™
paambaati @thundubeedi
Rendering is what a browser does to display contents of a webpage.
Rendered webpage's source HTML is our target.
Yes and no.
Crawling is fetching simple webpages, and doesn't work with SPAs, PWAs and webpages that use Ajax to lazily load data.
Rendering is intelligently fetching the page source for any kind of webpage.
Webpages are one of our many sources of product information.
Indix has more than ⚡️1 billion products across 62K brands from 2400 websites, and is tracking & discovering more every second.
Chrome Debugging Protocol is a set of APIs you can use to programatically control Chrome.
Effort is underway to standardize CDP API across all browsers (see RemoteDebug project).
It was originally built alongside "headless" mode for making automation testing easier.
They're all too big, too slow, require a dummy display driver and/or don't have low-level network APIs for our usecase.
☠️ PhantomJS is no longer actively maintained after headless Chrome came out.
CDP exposes a remote debugging port, to which you can connect and start sending commands. Some of the cool things you can do include —
Interest in CDP has shot through the roof in the past few months, and so there are a lot of new clients on top of the protocol.
Client | Language |
---|---|
Puppeteer* | Node.js |
Navalia | Node.js |
gcd | Golang |
PyChromeDevTools | Python |
cdp4j | Java |
chrome-reactive-kotlin | Kotlin |
... is our in-house rendering engine for Ajax-based sites, built on top of CDP.
Shadowfax is an NPM module, and predates the Chrome team's Puppeteer & Rendertron projects by months. It started as an experiment on the then-new headless mode.
createTarget()
command.navigate(url)
command to navigate to given URL.requestWillBeSent
events.document.documentElement.outerHTML
with Runtime.evaluate()
to get full HTML source.closeTarget()
command.Approach #1
(Shadowfax + Background Chrome) + Docker + Mesos = 🙁
What happened
Approach #2
Shadowfax + Docker + Chrome As A Service + Mesos = 🙁
What happened
Approach #4961
Shadowfax + On-Demand Chrome + Mesos = 😎
What we learned
4x
over PhantomJS, 3x
over Nightmare.js2x
over PhantomJS & Nightmare.js