🎯 A customizable, anti-detection cloud browser powered by self-developed Chromium designed for web crawlers and AI Agents.πŸ‘‰Try Now
Back to Blog

Async Python Web Scraping: Scale to 10,000+ URLs with aiohttp and Scrapeless

Ethan Brown
Ethan Brown

Advanced Bot Mitigation Engineer

28-May-2026

Key Takeaways:

  • Async beats sync by ~10–100Γ— on I/O-bound scrapes. asyncio's event loop lets one Python thread juggle hundreds of in-flight HTTP requests; the sync equivalent blocks on every socket read and pays full latency per URL.
  • aiohttp is the canonical async HTTP client. A single aiohttp.ClientSession holds the connection pool, keep-alives, cookies, and timeouts β€” pair it with asyncio.gather for fan-out and an asyncio.Semaphore for the per-host cap.
  • Scrapeless residential proxies route the async fetches. One proxy URL plugs straight into aiohttp.ClientSession(... proxy=...), gives every request a different residential IP, and pins egress geography with a country code embedded in the username.
  • The Scrapeless Scraping Browser handles the JS-rendered minority. Pages that aiohttp returns as a JS-app shell (Next.js, React, Vue) get escalated to a cloud browser session β€” connected from async Python via the Scrapeless Python SDK plus Playwright's async API.
  • Failures stay out-of-band. asyncio.gather(return_exceptions=True) keeps a single bad URL from cancelling the rest of the fan-out; failed URLs go to a dead-letter list for separate review, not into an inline loop.
  • Free to start. New Scrapeless accounts include free Scraping Browser runtime β€” sign up at Scrapeless.

Introduction: Why async, and what serial scraping costs you

A synchronous Python scraper using requests blocks the thread on every socket read. Scrape 1,000 product pages at 500 ms per request and the wall-clock cost is roughly 500 seconds β€” latency paid in full, one URL at a time.

asyncio flips that. The event loop yields control while a socket is in-flight, lets the next coroutine start its own request, and weaves hundreds of fetches together on a single thread. The same 1,000 pages β€” capped at 10 concurrent requests per host β€” clear in roughly 50 seconds. Same hardware, same Python, same data.

The catch: async scraping has two failure modes the sync version never sees. Pipeline failures, where a thrown exception inside one coroutine can cancel the whole gather, and target-side pressure, where a tight in-flight pool looks indistinguishable from an attack if the proxy layer can't spread egress across IPs.

This guide walks through both. The HTTP tier uses aiohttp plus Scrapeless residential proxies in 195+ countries. The JS-rendered tier escalates to Scrapeless Scraping Browser, connected from async Python with the Scrapeless Python SDK and the Playwright async API.


What You Can Do With It

  • Crawl static catalogues at scale. Books, articles, sitemaps, anything that ships rendered HTML β€” async fan-out turns hour-long crawls into minute-long crawls.
  • Run concurrent feed pulls. RSS, JSON APIs, sitemap indexes; fan out across hundreds of endpoints with bounded concurrency.
  • Price-monitor across regions. Pin Scrapeless egress to US, GB, DE, JP and pull the same product page from multiple geos in parallel.
  • Audit-crawl your own site. Async sweeps a 10k-URL sitemap in minutes instead of hours and reports back the dead links and slow paths.
  • Hydrate downstream pipelines. The async layer feeds rendered HTML or JSON straight into Postgres, Snowflake, or Kafka without a thread-pool bottleneck.
  • Escalate selectively. Keep aiohttp on cheap HTTP for the ~70% of pages that ship rendered markup; only spin up cloud browser sessions for the JS-heavy minority.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this post is for demonstration purposes only.


Why Scrapeless for Async Scraping

Scrapeless Scraping Browser is a customizable, anti-detection cloud browser designed for web crawlers and AI agents; Scrapeless residential proxies are the proxy layer underneath it. For async Python pipelines specifically, the combination brings:

  • Residential proxies in 195+ countries, exposed as a single HTTP proxy URL that drops straight into aiohttp.ClientSession(... proxy=...).
  • Per-request geo pinning via a country code embedded in the proxy credentials β€” no per-request handshake cost, no per-coroutine session rebuild.
  • Sticky-session option for flows that need the same IP across a multi-step login or paginated traversal, and rotating IPs for everything else.
  • Cloud-side JS rendering when a page is React/Vue/Next.js heavy β€” the Python SDK mints a browser_ws_endpoint you connect to with Playwright's async API.
  • One API key for both tiers β€” proxies and Scraping Browser bill against the same Scrapeless account.

Get your API key on the free plan at Scrapeless.


Prerequisites

  • Python 3.10 or newer
  • A Scrapeless account and API key β€” sign up at app.scrapeless.com
  • Comfort with async/await and the event-loop model
  • A terminal

Step 1 β€” Install asyncio, aiohttp, and the Scrapeless SDK

aiohttp ships with built-in asyncio support. The scrapeless SDK mints cloud-browser sessions for the Step 6 escalation tier. Playwright's async API is the canonical async-Python way to drive the Scrapeless Scraping Browser:

bash Copy
pip install aiohttp scrapeless playwright
playwright install chromium

playwright install chromium downloads a local CDP client one time; the actual rendering still runs in Scrapeless's cloud β€” local Chromium is only the protocol speaker.


Step 2 β€” Configure your Scrapeless credentials

Export your Scrapeless API key, your channel ID, and your residential-proxy channel password as environment variables. All three are visible in the Scrapeless dashboard under Proxies β†’ Residential at app.scrapeless.com β€” click Generate and the dashboard prints a colon-delimited string in the form <GATEWAY>:<PORT>:<CHANNEL_ID>-proxy-country_US-r_10m-s_<SESSION_ID>:<PASSWORD>:

bash Copy
export SCRAPELESS_API_KEY="your_api_token_here"
export SCRAPELESS_CHANNEL_ID="your_channel_id"          # printed at the start of the username
export SCRAPELESS_PROXY_PASS="your_channel_password"
export SCRAPELESS_PROXY_GATEWAY="gw-us.scrapeless.io"   # see below for regional gateways

Regional gateways: gw-us.scrapeless.io (Americas), gw-eu.scrapeless.io (Europe), gw-ap.scrapeless.io (Asia-Pacific). Pick the gateway closest to your runtime to keep handshake latency low; the egress country is still controlled by the country_<CC> username param regardless of which gateway you connect through. Port is 8789 for all.

The residential-proxy username is constructed from four parameters:

  • <CHANNEL_ID> β€” your channel identifier (printed at the start of the username on the dashboard).
  • country_<CC> β€” country pin in two-letter form. Scrapeless uses country_US, country_UK, country_DE, country_JP, etc. (note: UK, not the ISO GB).
  • r_<duration> β€” sticky-session rotation interval (e.g. r_10m keeps the same IP for 10 minutes before rotating).
  • s_<SESSION_ID> β€” sticky-session identifier; reuse the same s_<id> across requests to hold the same IP for the duration window.

Drop r_ and s_ to get rotating IPs (a fresh residential IP per request). Keep them for flows that need session continuity, like a paginated traversal after a login.


Step 3 β€” Basic: a single async fetch with aiohttp + Scrapeless proxies

The smallest functional async scraper. One ClientSession, one GET, one HTML payload returned through the residential proxy:

python Copy
import asyncio
import os
import aiohttp

PROXY = (
    f"http://{os.environ['SCRAPELESS_CHANNEL_ID']}-proxy-country_US"
    f":{os.environ['SCRAPELESS_PROXY_PASS']}"
    f"@{os.environ['SCRAPELESS_PROXY_GATEWAY']}:8789"
)

async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    timeout = aiohttp.ClientTimeout(total=30)
    async with session.get(url, proxy=PROXY, timeout=timeout) as resp:
        resp.raise_for_status()
        return await resp.text()

async def main() -> None:
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, "https://books.toscrape.com/")
        print(f"Fetched {len(html):,} chars via US residential egress")

if __name__ == "__main__":
    asyncio.run(main())

Three things this snippet locks in early:

  • ClientSession is created once and reused. Every session.get(...) shares the same connection pool β€” recreating the session per request defeats the whole point of async.
  • The proxy URL is passed per-request, not per-session. That keeps the same ClientSession free to route different requests through different countries.
  • ClientTimeout(total=30) bounds each request. A single hung connection cannot block the rest of the gather.

Step 4 β€” Advanced: scale to concurrent fetches with asyncio.gather and a Semaphore cap

Fanning out to 100 URLs without a concurrency cap is how a scraper gets blocked in 10 seconds. The canonical pattern is asyncio.Semaphore to bound in-flight requests per host:

python Copy
import asyncio
import os
import aiohttp

PROXY = (
    f"http://{os.environ['SCRAPELESS_CHANNEL_ID']}-proxy-country_US"
    f":{os.environ['SCRAPELESS_PROXY_PASS']}"
    f"@{os.environ['SCRAPELESS_PROXY_GATEWAY']}:8789"
)

# Cap at 5 concurrent requests per host. Tune by target β€” public catalogues
# tolerate higher, anti-bot-protected origins want lower.
PER_HOST = asyncio.Semaphore(5)

async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    async with PER_HOST:
        timeout = aiohttp.ClientTimeout(total=30)
        async with session.get(url, proxy=PROXY, timeout=timeout) as resp:
            resp.raise_for_status()
            return await resp.text()

async def main() -> None:
    urls = [
        f"https://books.toscrape.com/catalogue/page-{n}.html"
        for n in range(1, 51)  # 50 catalogue pages
    ]
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        pages = await asyncio.gather(*tasks)
    print(f"Got {len(pages)} pages, total {sum(len(p) for p in pages):,} chars")

if __name__ == "__main__":
    asyncio.run(main())

asyncio.Semaphore(5) is the load-bearing line. Without it, asyncio.gather launches all 50 coroutines simultaneously and the gateway either rate-limits or refuses half of them. With it, only 5 are in-flight at a time; the rest wait in the event loop until a slot frees up.

For multi-host fan-out, create one Semaphore per origin and key it by hostname β€” that way a stall on one origin does not choke fetches against the others.

Get your API key on the:

Scrapeless free plan


Step 5 β€” Handle failures without blocking the pipeline

One raise_for_status() inside a coroutine will cancel the entire gather and lose every other in-flight result. Two defenses:

Defense 1: return_exceptions=True. Tell gather to capture exceptions as values instead of propagating them. The pipeline finishes either way; the caller decides afterward which URLs to act on.

Defense 2: a dead-letter list. Collect failed URLs in a separate structure for separate review. Failure handling stays out-of-band β€” async pipelines stay clean when success and failure paths don't interleave.

python Copy
import asyncio
import json
import aiohttp

async def fetch_safe(session, url):
    try:
        async with session.get(
            url, timeout=aiohttp.ClientTimeout(total=30)
        ) as resp:
            resp.raise_for_status()
            return {"url": url, "html": await resp.text()}
    except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
        return {"url": url, "error": repr(exc)}

async def main(urls):
    async with aiohttp.ClientSession() as session:
        results = await asyncio.gather(
            *(fetch_safe(session, u) for u in urls)
        )

    ok = [r for r in results if "html" in r]
    failed = [r for r in results if "error" in r]
    print(f"Successful: {len(ok)}   Failed: {len(failed)}")

    # Dead-letter file for separate review β€” pipeline never blocks on failures
    with open("dead_letter.jsonl", "w", encoding="utf-8") as f:
        for r in failed:
            f.write(json.dumps(r) + "\n")

Two things to notice:

  • The error envelope ({"url": ..., "error": ...}) has the same shape as the success envelope, just with a different key. Downstream consumers branch on which key is present without parsing exception text.
  • aiohttp.ClientError covers the common failure surface (connection drops, malformed responses, DNS issues); asyncio.TimeoutError is raised by ClientTimeout. Catching both covers ~95% of real-world async scrapes.

What this code deliberately does not do: nothing in the success path re-issues a failed URL. Dead-letter processing belongs in a separate run β€” with a different proxy country, a different concurrency cap, or the cloud browser tier from Step 6. Mixing it inline turns one async scraper into two interleaved control flows, and the bugs land in the interleaving.


Step 6 β€” Escalate JS-rendered pages to Scrapeless Scraping Browser

aiohttp returns whatever bytes the origin sends. For Next.js, React, and Vue apps, those bytes are an empty <div id="root"> plus a script tag β€” the actual content paints client-side. Plain HTTP cannot render that; a cloud browser can.

The cleanest escalation pattern: keep aiohttp on the ~70% of pages that ship rendered HTML, and escalate the JS-rendered minority to Scrapeless Scraping Browser. The Python SDK mints a cloud-browser session and exposes a browser_ws_endpoint; Playwright's async API connects to it via the Chrome DevTools Protocol:

python Copy
import asyncio
from scrapeless import Scrapeless
from scrapeless.types import ICreateBrowser
from playwright.async_api import async_playwright

async def render_via_cloud_browser(url: str, country: str = "US") -> str:
    client = Scrapeless()  # reads SCRAPELESS_API_KEY from env
    session = client.browser.create(
        ICreateBrowser(proxy_country=country, session_ttl=240)
    )

    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(session.browser_ws_endpoint)
        context = browser.contexts[0] if browser.contexts else await browser.new_context()
        page = context.pages[0] if context.pages else await context.new_page()
        await page.goto(url, wait_until="networkidle", timeout=60_000)
        html = await page.content()
        await browser.close()
        return html

async def main():
    # quotes.toscrape.com/js/ is the canonical "needs JS" sandbox.
    # Plain HTTP returns 0 quote elements; cloud-rendered returns 10.
    html = await render_via_cloud_browser("https://quotes.toscrape.com/js/")
    print(f"Rendered {len(html):,} chars including post-paint DOM")

if __name__ == "__main__":
    asyncio.run(main())

session.browser_ws_endpoint is a wss://browser.scrapeless.com/...?token=... URL. Playwright's connect_over_cdp speaks CDP to that endpoint; the rendering runs in Scrapeless's cloud, not on the local machine. The local playwright install chromium step is only the protocol client.

session_ttl=240 keeps the session alive for 4 minutes β€” enough for a multi-step traversal on a single page. For long-running crawls, mint a fresh session per URL or per logical work-unit; cloud sessions are cheap to create.


Step 7 β€” Put it all together: a tiered async scraper

The realistic shape of an async scraping pipeline is HTTP-first, browser-second: try aiohttp, escalate the empty or blocked responses to Scrapeless Scraping Browser. The two tiers share concurrency caps but live in separate Semaphores β€” cloud-browser sessions are scarcer than HTTP requests.

python Copy
import asyncio
import os
import aiohttp
from scrapeless import Scrapeless
from scrapeless.types import ICreateBrowser
from playwright.async_api import async_playwright

PROXY = (
    f"http://{os.environ['SCRAPELESS_CHANNEL_ID']}-proxy-country_US"
    f":{os.environ['SCRAPELESS_PROXY_PASS']}"
    f"@{os.environ['SCRAPELESS_PROXY_GATEWAY']}:8789"
)
HTTP_LIMIT = asyncio.Semaphore(10)      # aiohttp tier
BROWSER_LIMIT = asyncio.Semaphore(3)    # cloud browser tier

async def http_fetch(session: aiohttp.ClientSession, url: str) -> str | None:
    async with HTTP_LIMIT:
        try:
            async with session.get(
                url, proxy=PROXY,
                timeout=aiohttp.ClientTimeout(total=30),
            ) as resp:
                resp.raise_for_status()
                return await resp.text()
        except (aiohttp.ClientError, asyncio.TimeoutError):
            return None

async def browser_fetch(client: Scrapeless, url: str) -> str:
    async with BROWSER_LIMIT:
        session = client.browser.create(
            ICreateBrowser(proxy_country="US", session_ttl=240)
        )
        async with async_playwright() as p:
            browser = await p.chromium.connect_over_cdp(session.browser_ws_endpoint)
            context = (
                browser.contexts[0] if browser.contexts
                else await browser.new_context()
            )
            page = await context.new_page()
            await page.goto(url, wait_until="networkidle", timeout=60_000)
            html = await page.content()
            await browser.close()
            return html

from urllib.parse import urlparse

# (a) Known JS-heavy hosts always escalate β€” most reliable signal.
JS_HEAVY_HOSTS = {"quotes.toscrape.com"}

def should_escalate(url: str, html: str | None) -> bool:
    # (a) Allowlist hit β€” explicit JS-heavy host.
    if urlparse(url).hostname in JS_HEAVY_HOSTS:
        return True
    # (b) Post-parse signal β€” empty body or recognisable app shell.
    if html is None or len(html) < 2000 or '<div id="root"></div>' in html:
        return True
    return False

async def scrape_one(http_session, client, url):
    html = await http_fetch(http_session, url)
    tier = "http"
    if should_escalate(url, html):
        tier = "browser"
        html = await browser_fetch(client, url)
    return {"url": url, "tier": tier, "html_len": len(html) if html else 0}

async def main(urls):
    client = Scrapeless()
    async with aiohttp.ClientSession() as http_session:
        results = await asyncio.gather(
            *(scrape_one(http_session, client, u) for u in urls)
        )
    return results

if __name__ == "__main__":
    urls = [
        "https://books.toscrape.com/",       # static β€” aiohttp tier
        "https://quotes.toscrape.com/js/",   # JS β€” escalates
    ]
    print(asyncio.run(main(urls)))

should_escalate combines both signals the prose mentions: (a) an explicit allowlist of "known JS-heavy" hosts, and (b) a post-parse signal (empty body / app shell). The allowlist is the more reliable lever β€” a Next.js or React shell often clears the 2,000-byte threshold even when the body is empty, so an <div id="root"></div> check alone misses it. The hostname check fires before any byte counting.


What You Get Back

The pipeline emits a list of dicts shaped like this:

json Copy
[
  {
    "url": "https://books.toscrape.com/",
    "tier": "http",
    "html_len": 51274
  },
  {
    "url": "https://quotes.toscrape.com/js/",
    "tier": "browser",
    "html_len": 9246
  }
]

Honest observations from running this pattern:

  • Cold connection cost is real. The first request on a fresh ClientSession pays TLS + DNS; subsequent requests on the same session reuse the connection. Do not recreate the session per request.
  • Concurrency caps depend on the target, not on aiohttp. Five per host is a safe starting point for public catalogues; three is safer for anti-bot-protected origins; ten is realistic for friendly APIs.
  • Cloud-browser sessions outlive single page loads. If a pipeline needs login plus traverse plus extract, mint one session per logical work-unit and reuse it across the pages within that unit β€” context.new_page() is cheap inside the same session.
  • DNS resolution stays inside aiohttp. The connector caches resolved IPs for the lifetime of the ClientSession. For long-running crawlers, recycle the session every few hours so DNS does not go stale.
  • ClientTimeout(total=30) is per-request, not per-gather. A 1,000-URL fan-out does not time out at 30 s β€” each request gets its own 30-second budget.

Conclusion: scale your async Python scrapers

The async pattern reduces to four moves: spin up one ClientSession, cap concurrency with a Semaphore, route fan-out through Scrapeless residential proxies, and escalate the JS-rendered minority to Scrapeless Scraping Browser via the Python SDK plus Playwright's async API.

To go deeper on the residential-proxy layer that routes every async fetch, see What Is an SSL Proxy?.

Pin egress with the country suffix on the proxy username, keep per-host Semaphores tight, branch on the success-vs-failure envelope shape rather than catching exceptions inline, and treat an empty HTTP response as the signal to escalate β€” not the answer.


Ready to Build Your AI-Powered Data Pipeline?

Join our community to claim a free plan and connect with developers building async scraping pipelines: Discord Β· Telegram.

Sign up at Scrapeless for free Scraping Browser runtime and adapt the patterns above to the catalogues, feeds, and regions the pipeline needs. Pricing details at scrapeless.com/en/pricing; residential proxies are documented at scrapeless.com/en/product/proxy-solutions; full SDK reference at docs.scrapeless.com.


FAQ

Q1: How many concurrent requests should I run per host?

For public catalogues with no anti-bot stack, 10 in-flight requests per host is a safe ceiling. For anti-bot-protected origins, 3 is more realistic. The Semaphore is the lever; start low, watch for 429 responses, and tune from there.

Q2: Do I need Scrapeless residential proxies if my target is unblocked from my datacenter?

For unblocked HTTP targets, no β€” aiohttp works without a proxy. The Scrapeless proxy layer earns its keep when the target geo-restricts (you need US/UK/JP egress), when your datacenter IP is rate-limited or blocked, or when you need a fresh residential IP per request to spread the in-flight pool across many origins.

Q3: When should I escalate from aiohttp to Scrapeless Scraping Browser?

When the HTML aiohttp returns is a JS-app shell with no content. Heuristic: count the elements you care about after first-tier fetch; if the count is zero or far below expected, the page renders client-side. The cloud browser tier handles those.

Q4: Is async scraping legal?

Async is a transport pattern; legality depends on what you scrape, from where, and under what terms. Publicly visible data is generally accessible; jurisdictions vary; site terms of service apply; consult counsel for high-stakes use cases. Scrapeless accesses publicly available data only.

Q5: Can I use aiohttp without Scrapeless Scraping Browser?

Yes. The aiohttp tier (Steps 3–5) works as a complete async scraper for any target that ships rendered HTML. Scrapeless Scraping Browser is the escalation tier β€” invoked only when the HTTP tier comes back empty.

Q6: How do I pin egress to a specific country?

The country goes into the Scrapeless residential-proxy username as country_<CC> (uppercase two-letter code, underscore-separated): country_US, country_UK, country_DE, country_JP. Replace the segment in the username string and the gateway routes every request through residential IPs in that country. For browser-tier requests, pass proxy_country="US" to ICreateBrowser(...) when minting the cloud-browser session.

Q7: Why use Playwright's async API instead of a sync browser client?

Sync browser clients block the event loop. The whole point of asyncio is to keep the loop free; calling a sync page.goto(...) from inside a coroutine stalls every other in-flight task. Playwright's async_playwright is the only canonical Python option that keeps the cloud-browser tier coroutine-friendly. The Scrapeless SDK still mints the session β€” Playwright only speaks CDP to the browser_ws_endpoint.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.

Most Popular Articles

Catalogue