Web Scraping with JavaScript and Node.js: Cheerio vs Puppeteer
Lead Scraping Automation Engineer
Key Takeaways:
- The first decision is static vs dynamic, and it sets your whole toolchain. If the data is in the initial HTML, parse it with Cheerio; if JavaScript builds it, you need a real browser like Puppeteer.
- Cheerio is a parser, not a browser — and that's the point. It loads HTML and gives you jQuery-style selectors at native speed, with no rendering overhead, for pages whose content is already in the markup.
- Puppeteer renders, so it sees what users see. For client-rendered pages, infinite scroll, and content behind interactions, Puppeteer runs the JavaScript and hands you the finished DOM.
- Both run on the same Scrapeless session. Fetch the HTML through the cloud browser, then either parse it with Cheerio or extract it live with Puppeteer — same anti-detection and residential egress underneath.
- Verified side by side. The same catalog page below yields 20 items through Cheerio and 20 through Puppeteer — proof the two paths agree when the content is present.
- Free to start. New Scrapeless accounts include free Scraping Browser runtime — sign up at app.scrapeless.com.
Introduction: pick the right tool for the page
JavaScript and Node.js are a natural fit for web scraping — the same language the browser runs, with a mature ecosystem for HTTP and HTML. But "scrape with Node" splits immediately into two very different jobs, and picking the wrong one wastes effort.
If the data you want is already in the page's initial HTML, you don't need a browser at all — you need a fast parser. That's Cheerio: load the markup, run selectors, done. If the data is built by JavaScript after load — a React app, an infinite-scroll feed, content that appears only after a click — then a parser sees nothing, because the HTML it parses is an empty shell. That's where Puppeteer (or Playwright) comes in: it runs the page's JavaScript and gives you the rendered DOM.
The practical problem under both is access: real sites fingerprint, rate-limit, and geo-gate. This guide runs both approaches on Scrapeless Scraping Browser — an anti-detection cloud browser — so the fetch succeeds, then shows when to reach for Cheerio and when for Puppeteer. Both paths below were run live against the same page.
Static vs dynamic: how to tell
| Static (Cheerio) | Dynamic (Puppeteer) | |
|---|---|---|
| Where the data lives | In the initial HTML | Built by JS after load |
| Tool | A parser | A real browser |
| Speed | Fast, low overhead | Slower, renders the page |
| Use when | Server-rendered pages, articles, catalogs | SPAs, infinite scroll, post-click content |
The quick test: view the page source (not the inspector). If the data is in the raw HTML, Cheerio is enough. If the source is a near-empty shell and the content only appears in the live DOM, you need Puppeteer.
Why Scrapeless Scraping Browser
Scrapeless Scraping Browser is a customizable, anti-detection cloud browser designed for web crawlers and AI agents. For Node scraping specifically, it brings:
- A standard Puppeteer connection —
Puppeteer.connect()returns a normalBrowser, so your code is unchanged. - Cloud-side JS rendering — dynamic pages actually build their content, so both
page.content()(for Cheerio) and live extraction work. - Residential proxies in 195+ countries — pin egress so the fetch succeeds and stays consistent.
- Anti-detection fingerprinting — the session reads as a real browser, so pages render instead of challenging.
- Session persistence — keep cookies warm across a multi-page run.
Get your API key on the free plan at app.scrapeless.com.
Prerequisites
- Node.js 18 or newer
- A Scrapeless account and API key — sign up at app.scrapeless.com
- Basic familiarity with CSS selectors
Install
bash
npm install @scrapeless-ai/sdk puppeteer-core cheerio
bash
export SCRAPELESS_API_KEY="your_api_token_here"
Connect
javascript
import { Puppeteer } from '@scrapeless-ai/sdk';
const browser = await Puppeteer.connect({
apiKey: process.env.SCRAPELESS_API_KEY,
sessionName: 'js-node-scraping',
proxyCountry: 'US',
sessionTTL: 300,
});
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/', {
waitUntil: 'domcontentloaded',
timeout: 60000,
});
Path A — Cheerio (static parsing)
When the content is in the HTML, grab the markup with page.content() and parse it with Cheerio. The selector API is jQuery-style, so it reads naturally:
javascript
import * as cheerio from 'cheerio';
const html = await page.content();
const $ = cheerio.load(html);
const titles = $('.product_pod h3 a')
.map((i, el) => $(el).attr('title'))
.get();
console.log(titles.length, '—', titles[0]);
// 20 — A Light in the Attic
Cheerio doesn't render anything — it just parses the string you give it. That makes it fast and ideal once you already hold the HTML. You can also use it on HTML from any source, not only a browser.
Get your API key on the free plan: app.scrapeless.com
Path B — Puppeteer (dynamic extraction)
When the content is built by JavaScript, extract it from the live DOM inside the rendered page. Same selectors, but evaluated in the browser after the page's scripts have run:
javascript
const titles = await page.evaluate(() =>
[...document.querySelectorAll('.product_pod h3 a')].map((a) => a.getAttribute('title')),
);
console.log(titles.length, '—', titles[0]);
// 20 — A Light in the Attic
On the same catalog page, both paths return the same 20 titles — because the content is present in the HTML, either approach works. The difference shows on a client-rendered page: Cheerio on the raw HTML would find nothing, while the Puppeteer path still returns the items because the page rendered first.
For dynamic content that loads on interaction, drive the page before extracting — scroll for lazy content, click to reveal, then waitForSelector on the result:
javascript
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForSelector('.product_pod', { timeout: 10000 });
// ...then extract as above
Choosing between them
- Content in the raw HTML? Use Cheerio — it's faster and simpler.
- Content built by JavaScript, infinite scroll, or behind a click? Use Puppeteer to render, then extract.
- Both on one page? Common — render with Puppeteer, then hand
page.content()to Cheerio if you prefer its selector ergonomics for the static parts.
What You Get Back
Either path produces the same flat list when the data is present:
json
{
"count": 20,
"first": "A Light in the Attic"
}
// Real capture: both Cheerio and Puppeteer returned 20 titles from the same page.
A few honest observations:
- Don't render when you don't have to. If the HTML already has the data, Cheerio skips the rendering cost entirely.
- Render when the source is a shell. A near-empty raw HTML with a populated live DOM is the signal to use Puppeteer.
- Wait on content, not the clock. For dynamic pages,
waitForSelectorbeats a fixedsetTimeout. - Selectors are shared knowledge. The same CSS selectors work in Cheerio and in
querySelectorAll, so moving between paths is cheap.
Conclusion: one decision, two clean paths
Web scraping with JavaScript and Node.js comes down to a single early call — is the data in the HTML, or built by JavaScript? Cheerio handles the first case at parser speed; Puppeteer handles the second by rendering the page. Running both on Scrapeless Scraping Browser means the fetch succeeds either way, with residential egress and anti-detection underneath. For a deeper anti-bot workflow, see the Scrapling + Scrapeless guide; the Scraping Browser product page and docs cover the full SDK surface. Check the raw HTML first, reach for Cheerio when you can and Puppeteer when you must, and wait on content not the clock.
Ready to Build Your AI-Powered Data Pipeline?
Join our community to claim a free plan and connect with developers building Node scrapers: Discord · Telegram.
Sign up at app.scrapeless.com for free Scraping Browser runtime and adapt the patterns above to the static and dynamic pages your workflow needs. See pricing for scale.
FAQ
Q: When should I use Cheerio instead of Puppeteer?
When the data is already in the page's initial HTML. Cheerio just parses markup, so it's faster and simpler — no rendering. Use Puppeteer when JavaScript builds the content.
Q: How do I know if a page is static or dynamic?
View the raw page source (not the inspector). If the data is in the source, it's static — Cheerio works. If the source is a near-empty shell and content only appears in the live DOM, it's dynamic — use Puppeteer.
Q: Can I use both on the same page?
Yes. Render with Puppeteer, then pass page.content() to Cheerio if you prefer its selector ergonomics for the static parts.
Q: Cheerio vs Playwright vs Puppeteer — which?
Cheerio for static parsing. Puppeteer or Playwright (both full browsers) for dynamic rendering — pick whichever your stack already uses; the Scrapeless session works with both over CDP.
Q: Do I need a proxy?
For public static pages, often not — but pinning proxyCountry gives a consistent residential IP that real sites treat as a normal visitor, which matters more as you scale.
Q: Can I run this without an AI agent?
Yes. It's the Scrapeless SDK plus plain Puppeteer and Cheerio — no agent required.
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.



