🎯 A customizable, anti-detection cloud browser powered by self-developed Chromium designed for web crawlers and AI Agents.πŸ‘‰Try Now
Back to Blog

How to Build Production-Grade Web Scrapers with Scrapling and Scrapeless in Python

Ava Wilson
Ava Wilson

Expert in Web Scraping Technologies

25-May-2026

Key Takeaways:

  • Scrapling ships three fetchers and adaptive selectors. The HTTP Fetcher (with browser TLS impersonation), the Playwright-backed DynamicFetcher, and the StealthyFetcher cover static pages, JavaScript-rendered pages, and medium anti-bot in one Python library β€” and adaptive selectors locate elements by structure, not just by a brittle CSS path.
  • The escalation is HTTP β†’ stealth β†’ cloud browser. Start with the cheapest fetcher that works; when local stealth hits IP reputation, advanced bot managers, or geo-locked content, you escalate to a cloud browser without rewriting your parsing code.
  • The integration is one line. Point Scrapling's Playwright fetcher at a Scrapeless session over CDP β€” DynamicFetcher.fetch(url, cdp_url=session.browser_ws_endpoint) β€” and the rendering, proxy egress, and fingerprinting all move cloud-side.
  • Scrapeless handles the egress and the fingerprint. The Scrapeless Scraping Browser routes through residential proxies in 195+ countries and randomizes the browser fingerprint per session, so the cloud browser renders pages that a local stealth browser gets filtered on.
  • Adaptive selectors survive DOM drift. Scrapling can relocate an element after a layout change by matching its previous attributes and position, so a scraper keeps returning rows when the target site rotates its markup.
  • Free to start. New Scrapeless accounts include free Scraping Browser runtime β€” sign up at Scrapeless.

Introduction: When local stealth runs out of road

Dynamic, JavaScript-heavy, anti-bot-protected pages are where simple HTTP scrapers quietly fail. A requests + BeautifulSoup script returns HTTP 200 and an empty result set: the markup it parsed never contained the data, because the prices, listings, or reviews are injected by JavaScript after the initial response. The page looks fine in a browser and looks empty to your scraper.

Scrapling raises the floor. It is a fast Python scraping library with three fetchers β€” an HTTP Fetcher that impersonates a real browser's TLS handshake, a Playwright-backed DynamicFetcher for JavaScript rendering, and a StealthyFetcher that adds stealth patches and Cloudflare handling β€” plus adaptive selectors that survive DOM changes. That covers static pages and a good amount of medium anti-bot. But a browser running on your laptop still carries a datacenter or home IP with a known reputation, and advanced bot managers fingerprint the automation regardless of how clean the page-level stealth is. At that point the page renders for a human and gets challenged for you.

This tutorial builds a Python pipeline in two tiers. Tier 1 is Scrapling on its own β€” the right tool for static and medium-protected pages. Tier 2 routes Scrapling's DynamicFetcher through the Scrapeless Scraping Browser over CDP, so the rendering happens cloud-side behind residential proxies and per-session anti-detection fingerprinting while your Scrapling parsing code stays exactly the same. For the same Scrapeless Scraping Browser primitive driven through an agent framework instead of a fetcher, see the LangChain integration post.


What You Can Build

The two-tier pattern β€” Scrapling fetchers in front, Scrapeless Scraping Browser behind the escalation β€” covers most of the jobs that break a plain HTTP scraper:

  • Price and stock monitors on SPA storefronts. Render single-page-app product pages whose prices hydrate through a second XHR, then extract the numbers Scrapling parses out of the rendered DOM.
  • SERP-adjacent extraction. Pull organic result blocks and snippets from search-style result pages that ship as JavaScript, then page through them with adaptive selectors.
  • Lead lists from JavaScript directories. Walk business-listing and member-directory sites that render rows client-side, and collect contact fields into typed records.
  • Geo-specific snapshots via residential egress. Capture the listings, pricing, or availability a local user would see by pinning the Scrapeless proxy country, rather than whatever your office IP resolves to.
  • RAG ingestion of rendered pages. Render publisher and documentation pages to clean content for an embedding pipeline, so the retrieval layer indexes what the page actually shows, not an empty shell.
  • Resilient scrapers that survive layout changes. Lean on Scrapling's adaptive selectors so a scheduled scraper keeps returning rows after the target site reshuffles its DOM, instead of failing silently on the next run.
  • Hard-target extraction behind advanced anti-bot. Escalate the same Scrapling code to the Scrapeless Scraping Browser when a site front-loads an advanced bot manager that local stealth cannot clear.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this post is for demonstration purposes only.


Why pair Scrapling with Scrapeless

Scrapling handles parsing, fetcher ergonomics, and adaptive selectors; the Scrapeless Scraping Browser handles the evasion plumbing that a local browser cannot. The two slot together cleanly because the handoff is a single CDP endpoint.

  • Anti-detection cloud browser. The Scrapeless Scraping Browser runs a self-developed Chromium with full cloud-side JavaScript rendering, so SPAs, infinite-scroll feeds, and lazy-loaded panels hydrate before Scrapling parses them.
  • Residential proxies in 195+ countries. Set proxy_country when you mint a session and the cloud browser egresses from real residential IPs in the region you target, so geo-bound pages return what a local user sees.
  • Per-session fingerprint randomization. Each session gets a randomized fingerprint β€” user agent, timezone, WebGL, and canvas β€” so repeated runs do not collapse into a single detectable identity.
  • cdp_url drop-in. Pass the Scrapeless session endpoint as cdp_url to DynamicFetcher.fetch(...) and nothing else in Scrapling's API changes β€” same selectors, same result objects, same parsing code.
  • Session persistence via session_ttl. Hold a session open across multiple page loads by setting session_ttl at creation, so warm cookies and navigation state carry between requests in a single run.

Runtime is free to start and scales with usage β€” see Scrapeless pricing for the tiers, and get your API key on the free plan at Scrapeless.


How Scrapling compares to requests, BeautifulSoup, and Scrapy

If your current stack is requests + BeautifulSoup or Scrapy, here is where Scrapling fits and what changes once the Scrapeless cloud browser sits behind it.

Tool Renders JavaScript Anti-bot / stealth Selector resilience Best for
requests + BeautifulSoup No None (raw HTTP) Manual; breaks on redesign Small static pages and JSON APIs
Scrapy Only via add-ons (e.g. a Playwright integration) None built in Manual; breaks on redesign Large async crawls you build and host yourself
Scrapling β€” Fetcher No Browser-TLS impersonation Adaptive selectors available Fast static fetches with fingerprint-aware HTTP
Scrapling β€” DynamicFetcher / StealthyFetcher Yes (local browser) Stealth patches, Cloudflare handling Adaptive selectors JS pages and medium anti-bot, on your own machine
Scrapling + Scrapeless (cdp_url) Yes (cloud browser) Residential proxies in 195+ countries + per-session fingerprinting Adaptive selectors JS-heavy, geo-bound, or hard anti-bot pages at scale

The progression is additive. Keep requests/BeautifulSoup where it already works, reach for Scrapling's fetchers when a page needs a browser or fingerprint-aware HTTP, and route Scrapling through the Scrapeless Scraping Browser over cdp_url when local rendering gets filtered. The parsing code β€” page.css(...), page.xpath(...) β€” stays identical across all three.


Prerequisites

Before you start, make sure you have:

  • Python 3.10+ β€” Scrapling 0.4.8 requires it.
  • pip β€” to install the packages below.
  • A Scrapeless account and API key β€” sign up for the free plan at Scrapeless Website, then grab your key from Settings β†’ API Key Management.
  • Basic familiarity with CSS/XPath selectors and the terminal β€” you will use both to fetch pages and pull values out of them.

Install

You only need two packages: Scrapling for fetching and parsing, and the official Scrapeless SDK for minting cloud-browser sessions.

1. Install Scrapling and the Scrapeless SDK

bash Copy
pip install "scrapling[fetchers]" scrapeless
scrapling install   # fetches local browsers for DynamicFetcher / StealthyFetcher (skip if you only use the remote Scrapeless cloud browser via cdp_url)

scrapling[fetchers] gives you the fetch-and-parse layer (the Fetcher, DynamicFetcher, and StealthyFetcher classes plus a parsel-like selector API), while scrapeless is the official SDK that mints Scrapeless Scraping Browser sessions and hands you a CDP endpoint to connect to.

2. Set your Scrapeless API key

Export your key so the SDK can read it:

bash Copy
export SCRAPELESS_API_KEY=your_api_token_here

On Windows, use setx SCRAPELESS_API_KEY "your_api_token_here" (persistent, new shell) or $env:SCRAPELESS_API_KEY="your_api_token_here" (current PowerShell session). The Scrapeless SDK reads this variable automatically β€” you do not have to pass the key in code.

3. Smoke-test the install

Confirm the environment by importing the three fetchers and pulling a few values from a static page:

python Copy
from scrapling.fetchers import Fetcher, DynamicFetcher, StealthyFetcher

page = Fetcher.get("https://quotes.toscrape.com/")
print(page.status, len(page.css("span.text::text")), "quotes")  # -> 200 10 quotes

If you see 200 10 quotes, Scrapling is installed and parsing correctly, and you are ready to wire in the Scrapeless Scraping Browser.


Step 1 β€” The three Scrapling fetchers

Scrapling ships three fetchers that trade speed for evasion. The rule of thumb is to pick the lightest one that returns the data you need, then escalate only when a request comes back blocked or empty: start with Fetcher for plain HTTP, move to DynamicFetcher when the page needs JavaScript to render, and reach for StealthyFetcher when an anti-bot layer is in the way.

Fetcher Engine Use when
Fetcher HTTP via curl_cffi (browser-TLS impersonation) Static pages and JSON APIs β€” no JavaScript needed
DynamicFetcher Playwright JS-rendered pages, SPAs, lazy-loaded content
StealthyFetcher Stealth Playwright Anti-bot defenses such as Cloudflare interstitials

All three import from the same module and return a parsel-like response, so the selector API (.css(...), .xpath(...)) is identical regardless of which fetcher produced the page.

python Copy
from scrapling.fetchers import Fetcher

# HTTP fetch with browser-TLS impersonation; returns a parsel-like Response.
page = Fetcher.get("https://books.toscrape.com/", impersonate="chrome", stealthy_headers=True)
title = page.css("article.product_pod h3 a::attr(title)")  # -> first book title
python Copy
from scrapling.fetchers import DynamicFetcher

# Playwright renders the JS, then returns the hydrated page as a parsel-like Response.
# Needs a local browser (run `scrapling install`) or a remote one via cdp_url (next step).
page = DynamicFetcher.fetch("https://quotes.toscrape.com/js/", network_idle=True)
quotes = page.css("span.text::text")  # -> the JS-rendered quotes, now visible
python Copy
from scrapling.fetchers import StealthyFetcher

# Stealth Playwright attempts the anti-bot handshake, then returns the page.
page = StealthyFetcher.fetch("https://example.com/protected-page",
                             solve_cloudflare=True, block_webrtc=True, hide_canvas=True)
content = page.css("main ::text")  # -> page text once the challenge clears

Step 2 β€” Adding a proxy (and where local stealth stops)

A clean datacenter IP is one of the first things a bot manager flags. Routing requests through residential proxies cuts those IP-reputation blocks because the egress address looks like an ordinary home connection. Every Scrapling fetcher accepts a proxy through the same proxy= argument, so you can add one without changing the rest of your code.

python Copy
from scrapling.fetchers import Fetcher, StealthyFetcher

# Pass a proxy string to any fetcher via proxy=.
page = Fetcher.get("https://books.toscrape.com/",
                   proxy="http://<user>:<pass>@<host>:<port>")

# StealthyFetcher takes the same argument while it works the anti-bot challenge.
page = StealthyFetcher.fetch("https://example.com/protected-page",
                            proxy="http://<user>:<pass>@<host>:<port>",
                            solve_cloudflare=True)

This is where Scrapeless fits the proxy story. Scrapeless provides residential proxies in 195+ countries, and there are two ways to put them in front of Scrapling. The simplest path applies them at the cloud-browser session level: when you connect DynamicFetcher to a Scrapeless Scraping Browser session, you pin the egress with proxy_country (shown in the next step) and never touch a proxy string at all. Scrapeless also offers a standalone proxy product whose gateway credentials drop straight into Scrapling's proxy= argument β€” see Scrapeless proxy solutions for the offering and docs.scrapeless.com for the exact gateway string to paste in place of the <user>:<pass>@<host>:<port> placeholder above.

A good proxy fixes the IP-reputation problem, but it does not fix everything. Local stealth browsers still get filtered by advanced bot managers and stall on heavy client-side JavaScript, no matter how clean the egress IP is. That is exactly where the Scrapeless cloud browser takes over β€” covered in the next step.


Step 3 β€” Route Scrapling through the Scrapeless cloud browser (cdp_url)

When local stealth runs out β€” heavy client-side rendering, an advanced bot manager, or a page that only resolves from a specific country β€” mint a Scrapeless session with the SDK and hand its CDP endpoint to Scrapling. The SDK mints the session; Scrapling drives it. The residential proxy (proxy_country) and browser fingerprinting are handled cloud-side by the Scrapeless Scraping Browser, so your Scrapling code stays the same except for one extra argument: cdp_url.

python Copy
from scrapeless import Scrapeless
from scrapeless.types import ICreateBrowser
from scrapling.fetchers import DynamicFetcher

client = Scrapeless()  # reads SCRAPELESS_API_KEY
session = client.browser.create(ICreateBrowser(proxy_country="US", session_ttl=240))

page = DynamicFetcher.fetch(
    "https://quotes.toscrape.com/js/",
    cdp_url=session.browser_ws_endpoint,
    network_idle=True,
)

quotes = page.css("span.text::text")
authors = page.css("small.author::text")

session.browser_ws_endpoint is a WebSocket CDP URL of the form wss://browser.scrapeless.com/browser?token=...&proxy... β€” Scrapling connects to it exactly as it would to a local browser. The before/after is the whole point: a plain Fetcher.get on https://quotes.toscrape.com/js/ returns 0 quotes because HTTP can't execute the page's JavaScript, while the same page fetched through the Scrapeless cdp_url renders 10 quotes (plus their authors). That is platform behavior, not a tuning trick β€” the cloud browser runs the JS, then Scrapling parses the resulting DOM.

StealthyFetcher.fetch(...) accepts cdp_url with the identical pattern when you want Scrapling's stealth layer on top of the cloud browser.


Step 4 β€” Adaptive selectors that survive DOM drift

Selectors are the most brittle part of any scraper: a redesign renames a class or moves an element, and every css(...) call silently returns nothing. Scrapling's adaptive mode hedges against that. Pass adaptive=True (and optionally adaptive_domain=...) on the fetch, and Scrapling stores a fingerprint of each element you select. When the DOM shifts, it relocates the saved element by similarity instead of by an exact path, so your selectors keep resolving across layout changes.

python Copy
from scrapling.fetchers import DynamicFetcher

# First run: select normally; Scrapling remembers what each element looked like.
page = DynamicFetcher.fetch(
    "https://quotes.toscrape.com/js/",
    cdp_url=session.browser_ws_endpoint,
    network_idle=True,
    adaptive=True,
    adaptive_domain="quotes.toscrape.com",
)
quotes = page.css("span.text::text")

# Later, after the site reshuffles its markup: re-select with the same call.
# Even if "span.text" no longer matches, adaptive mode relocates the saved element.
page = DynamicFetcher.fetch(
    "https://quotes.toscrape.com/js/",
    cdp_url=session.browser_ws_endpoint,
    network_idle=True,
    adaptive=True,
    adaptive_domain="quotes.toscrape.com",
)
quotes = page.css("span.text::text")

Reach for adaptive selectors on pages you scrape repeatedly and that ship frequent cosmetic changes β€” it absorbs the small DOM churn so you only revisit selectors after a genuine redesign.


Step 5 β€” Crawl multiple pages in one cloud session

Most real jobs span more than one URL β€” paginated listings, search results, category trees. Minting a fresh cloud browser per page throws away the warm session and pays the connection handshake again every time. Scrapling's DynamicSession holds a single connection to the Scrapeless cloud browser open and fetches every page through it, so cookies, the residential identity, and navigation state carry across the whole crawl.

python Copy
from scrapeless import Scrapeless
from scrapeless.types import ICreateBrowser
from scrapling.fetchers import DynamicSession

client = Scrapeless()
session = client.browser.create(ICreateBrowser(proxy_country="US", session_ttl=300))

rows = []
with DynamicSession(cdp_url=session.browser_ws_endpoint) as crawler:
    for n in range(1, 4):  # pages 1..3
        url = "https://quotes.toscrape.com/js/" if n == 1 else f"https://quotes.toscrape.com/js/page/{n}/"
        page = crawler.fetch(url, network_idle=True)
        quotes = page.css("span.text::text")
        authors = page.css("small.author::text")
        rows += [{"quote": str(q), "author": str(a)} for q, a in zip(quotes, authors)]

print(len(rows), "rows")  # ten quote rows per page accumulate here

Each page yields ten quote rows, accumulated into rows over one reused cloud-browser session. network_idle=True waits for each page to hydrate before extraction; treat a short or empty page as a retry signal rather than the end of the listing. For a list-to-detail crawl, fetch the listing first, collect the detail URLs with page.css("a::attr(href)"), then fetch each through the same crawler β€” the residential session and fingerprint stay constant across the entire walk.


Step 6 β€” Production hardening

Moving from a working script to a dependable job is mostly about backpressure and recovery. A few rules carry most of the weight:

  • Cap concurrency. Hold at ≀3 cloud-browser sessions per host. Pushing past that invites rate-limits and connection resets, and the marginal throughput rarely justifies the extra failures.
  • Retry transient connect errors with backoff. Tunnel and timeout errors such as ERR_TUNNEL_CONNECTION_FAILED are transient by nature. Catch them, wait with exponential backoff, and retry β€” don't treat a single failed connect as a dead page.
  • Rotate egress with ProxyRotator. Scrapling's ProxyRotator cycles proxies across requests so you don't hammer a target from one IP; combine it with the Scrapeless residential proxy for geo-bound work.
  • Reuse a session across steps. Mint once with session_ttl (e.g. 240 seconds) and pass the same browser_ws_endpoint through a multi-step flow β€” log in, navigate, extract β€” instead of paying to spin up a fresh browser per request.
  • Treat absent fields as nullable. Real pages omit ratings, authors, or prices on some rows. Default missing selectors to None rather than asserting they exist, so one sparse record doesn't crash the run.

Get your API key on the free plan: Scrapeless


What You Get Back

json Copy
[
  {
    "quote": "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.",
    "author": "Albert Einstein"
  },
  {
    "quote": "It is our choices, Harry, that show what we truly are, far more than our abilities.",
    "author": "J.K. Rowling"
  },
  {
    "quote": "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.",
    "author": "Albert Einstein"
  }
]
// Shape reflects the Step 3 extraction; values are illustrative samples.

A few honest observations from running this pipeline:

  • JS-rendered pages need the cloud browser. A plain HTTP fetch of https://quotes.toscrape.com/js/ returns 0 rows; only the cloud browser, reached via cdp_url, executes the JavaScript so Scrapling can parse it.
  • network_idle=True waits for hydration. It holds until the network settles, which matters on pages that fetch their content after the first paint.
  • Adaptive selectors reduce breakage, not verification. They absorb minor DOM churn, but after a major redesign you should still re-run selector discovery and confirm the output.
  • Pin proxy_country for geo-bound pages. Prices, availability, and consent walls vary by region; setting proxy_country keeps results consistent with the locale you're targeting.
  • Static pages don't need any of this. If Fetcher.get already returns the data, use it β€” escalate to the cloud browser only when HTTP comes back empty or blocked.
  • Transient tunnel and 5xx errors are retryable. A one-off ERR_TUNNEL_CONNECTION_FAILED or 500 is usually network noise, not a fault in your code β€” retry with backoff.

Conclusion: Lightest fetcher first, cloud browser when blocked

The pipeline reduces to three moves. Pick the lightest fetcher that works β€” Fetcher.get for static HTML. Escalate to the Scrapeless Scraping Browser by passing session.browser_ws_endpoint to Scrapling's cdp_url the moment a page comes back empty, blocked, or geo-gated. Then parse with adaptive selectors so the extraction survives the next layout change. You only pay for the cloud browser when you actually need it β€” see Scrapeless pricing for what the free tier covers β€” and the rest of your code stays plain Scrapling.

From here, the same cdp_url pattern plugs into larger systems. See the LangChain + Scrapeless guide for wiring cloud rendering into an agent, and the Etsy scraper walkthrough for a full site build. Before you ship: export SCRAPELESS_API_KEY, pin proxy_country for any geo-bound page, keep concurrency at ≀3 sessions per host, and treat absent fields as nullable.


FAQ

Is web scraping legal?

Scraping publicly available data is generally permissible in many jurisdictions, but the law is not uniform. Review each site's Terms of Service, avoid collecting personal or copyrighted data you have no right to, and remember that rules vary by jurisdiction. When in doubt, get legal advice for your specific use case.

Do you need a proxy?

For anything at scale, yes. Residential egress sharply cuts blocks compared with datacenter IPs, and it's required for pages that gate content by region. Scrapeless supplies residential proxies in 195+ countries β€” set proxy_country when you mint a session, or route through a proxy gateway β€” so you don't have to source and rotate IPs yourself.

When do you need the cloud browser versus local Scrapling?

Stay local when Fetcher.get returns the data β€” it's the fastest path. Escalate to the Scrapeless cloud browser via cdp_url when the page is heavy on client-side JavaScript, fronted by an advanced bot manager, or geo-restricted. The cloud browser runs the JS and applies anti-detection and residential egress that a local fetch can't match.

Why do you keep seeing ERR_TUNNEL_CONNECTION_FAILED or 5xx errors?

Retry. Those are transient connect errors, not bugs in your script. Wrap the fetch in a retry loop with exponential backoff and a sensible cap; most of these clear on the second or third attempt.

My selectors broke after a site redesign. How do you fix it?

Turn on adaptive selectors (adaptive=True with adaptive_domain=...) so Scrapling relocates saved elements through minor DOM churn. After a large redesign, re-run your selector discovery to confirm the new markup, then let adaptive mode hold the line again.

Are there concurrency limits you should respect?

Keep it to ≀3 cloud-browser sessions per host. Beyond that you trade a little throughput for a lot of rate-limiting and connection resets. Use bounded concurrency and a queue rather than firing every request at once.

Scrapling ships its own MCP (pip install "scrapling[ai]") β€” how does that relate to this?

You can layer both. Scrapling's MCP gives an AI agent direct control over the local library; the Scrapeless MCP server gives that agent cloud rendering with anti-detection and residential proxies. Use them together, or pick the one that fits your stack. This guide takes the library path β€” Scrapling driving the Scrapeless cloud browser through cdp_url β€” which keeps the integration to a single argument and no extra server to run.


Ready to Build Your AI-Powered Data Pipeline?

Join our community to claim a free plan and connect with developers building Scrapling + Scrapeless data pipelines: Discord Β· Telegram.

Sign up at Scrapeless for free Scraping Browser runtime and adapt the patterns above to the pages and regions your pipeline needs. Full reference at docs.scrapeless.com.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.

Most Popular Articles

Catalogue