🥳Join the Scrapeless Community and Claim Your Free Trial to Access Our Powerful Web Scraping Toolkit!
Back to Blog

Common Pitfalls and Solutions for Web Crawlers (With Code Examples)

Emily Chen
Emily Chen

Advanced Data Extraction Specialist

16-Sep-2025

Web crawlers are not just about sending HTTP requests—they must deal with JavaScript rendering, anti-bot defenses, scalability, and error handling. In this article, we’ll look at common pitfalls developers face when building crawlers and provide practical solutions with code snippets.


1. Ignoring Robots.txt and Crawl Policies

If your crawler ignores robots.txt, you risk legal issues or IP blocks.

Bad practice:

python Copy
import requests

html = requests.get("https://example.com").text
# No check for robots.txt

Better approach:

python Copy
import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if rp.can_fetch("*", "https://example.com/page"):
    print("Allowed to crawl")
else:
    print("Disallowed by robots.txt")

✅ Always respect crawl policies and implement rate limits.


2. Crawling Too Aggressively

Sending thousands of requests per second is a fast way to get banned.

Solution:

  • Add delays
  • Use async crawling for efficiency
python Copy
import asyncio, aiohttp, random

async def fetch(session, url):
    async with session.get(url) as resp:
        return await resp.text()

async def main():
    urls = ["https://example.com/page1", "https://example.com/page2"]
    async with aiohttp.ClientSession() as session:
        for url in urls:
            html = await fetch(session, url)
            print(len(html))
            await asyncio.sleep(random.uniform(1, 3))  # polite delay

asyncio.run(main())

3. Handling JavaScript-Rendered Content

Static crawlers miss JS-heavy pages (React, Vue, Angular).

Solution: Use a headless browser (e.g., Playwright, Puppeteer).

python Copy
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    print(page.content())  # now includes JS-rendered content
    browser.close()

4. Inefficient Data Extraction

Hardcoding fragile selectors leads to broken crawlers.

Better approach with BeautifulSoup + fallbacks:

python Copy
from bs4 import BeautifulSoup

html = "<div><h1 class='title'>Hello</h1></div>"
soup = BeautifulSoup(html, "lxml")

# Primary selector
title = soup.select_one("h1.title")

# Fallback
if not title:
    title = soup.find("h1")

print(title.text)

5. Duplicate Content Collection

URLs like /page?id=123&session=abc may cause duplicates.

Solution: Normalize URLs

python Copy
from urllib.parse import urlparse, urlunparse

def normalize(url):
    parsed = urlparse(url)
    clean = parsed._replace(query="")
    return urlunparse(clean)

print(normalize("https://example.com/page?id=1&session=xyz"))
# -> https://example.com/page

6. IP Blocking and Anti-Bot Mechanisms

Websites detect bots with rate anomalies, fingerprints, and CAPTCHAs.

Basic rotation with Scrapy:

python Copy
class RotateUserAgentMiddleware:
    user_agents = [
        "Mozilla/5.0 ...",
        "Chrome/91.0 ...",
        "Safari/537.36 ..."
    ]

    def process_request(self, request, spider):
        import random
        request.headers['User-Agent'] = random.choice(self.user_agents)

Solution stack:

  • Rotate proxies and user agents
  • Use residential/mobile proxies
  • Integrate CAPTCHA solvers when needed

7. Error Handling

Network errors are inevitable. Without retries, crawlers fail silently.

Example with retries:

python Copy
import requests, time

def fetch(url, retries=3):
    for i in range(retries):
        try:
            return requests.get(url, timeout=5)
        except requests.exceptions.RequestException as e:
            print(f"Error: {e}, retrying {i+1}")
            time.sleep(2**i)
    return None

8. Scalability Challenges

A crawler that works for 1,000 pages may fail at 10M.

Distributed crawling example with Scrapy + Redis:

bash Copy
scrapy runspider crawler.py -s JOBDIR=crawls/job1

Use:

  • Redis/Kafka for distributed task queues
  • Scrapy Cluster / Nutch for scaling
  • Cloud storage for crawl results

9. Data Quality Issues

Crawled data may contain duplicates, empty fields, or invalid formats.

Solution: Schema validation

python Copy
from pydantic import BaseModel, ValidationError

class Product(BaseModel):
    name: str
    price: float

try:
    item = Product(name="Laptop", price="not a number")
except ValidationError as e:
    print(e)

10. Security and Compliance

Crawlers must avoid scraping PII or restricted data.
Always check GDPR/CCPA compliance before storing user data.


Conclusion

Building a robust crawler requires technical precision and ethical responsibility. By addressing pitfalls like aggressive crawling, JavaScript rendering, anti-bot defenses, and scalability, developers can design crawlers that are:

  • Efficient (optimized resource usage)
  • Resilient (error-tolerant)
  • Compliant (legal and ethical)

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.

Most Popular Articles

Catalogue