Common Pitfalls and Solutions for Web Crawlers (With Code Examples)

Advanced Data Extraction Specialist
Web crawlers are not just about sending HTTP requests—they must deal with JavaScript rendering, anti-bot defenses, scalability, and error handling. In this article, we’ll look at common pitfalls developers face when building crawlers and provide practical solutions with code snippets.
1. Ignoring Robots.txt and Crawl Policies
If your crawler ignores robots.txt
, you risk legal issues or IP blocks.
Bad practice:
python
import requests
html = requests.get("https://example.com").text
# No check for robots.txt
Better approach:
python
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", "https://example.com/page"):
print("Allowed to crawl")
else:
print("Disallowed by robots.txt")
✅ Always respect crawl policies and implement rate limits.
2. Crawling Too Aggressively
Sending thousands of requests per second is a fast way to get banned.
Solution:
- Add delays
- Use async crawling for efficiency
python
import asyncio, aiohttp, random
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
async def main():
urls = ["https://example.com/page1", "https://example.com/page2"]
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session, url)
print(len(html))
await asyncio.sleep(random.uniform(1, 3)) # polite delay
asyncio.run(main())
3. Handling JavaScript-Rendered Content
Static crawlers miss JS-heavy pages (React, Vue, Angular).
Solution: Use a headless browser (e.g., Playwright, Puppeteer).
python
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://quotes.toscrape.com/js/")
print(page.content()) # now includes JS-rendered content
browser.close()
4. Inefficient Data Extraction
Hardcoding fragile selectors leads to broken crawlers.
Better approach with BeautifulSoup + fallbacks:
python
from bs4 import BeautifulSoup
html = "<div><h1 class='title'>Hello</h1></div>"
soup = BeautifulSoup(html, "lxml")
# Primary selector
title = soup.select_one("h1.title")
# Fallback
if not title:
title = soup.find("h1")
print(title.text)
5. Duplicate Content Collection
URLs like /page?id=123&session=abc
may cause duplicates.
Solution: Normalize URLs
python
from urllib.parse import urlparse, urlunparse
def normalize(url):
parsed = urlparse(url)
clean = parsed._replace(query="")
return urlunparse(clean)
print(normalize("https://example.com/page?id=1&session=xyz"))
# -> https://example.com/page
6. IP Blocking and Anti-Bot Mechanisms
Websites detect bots with rate anomalies, fingerprints, and CAPTCHAs.
Basic rotation with Scrapy:
python
class RotateUserAgentMiddleware:
user_agents = [
"Mozilla/5.0 ...",
"Chrome/91.0 ...",
"Safari/537.36 ..."
]
def process_request(self, request, spider):
import random
request.headers['User-Agent'] = random.choice(self.user_agents)
Solution stack:
- Rotate proxies and user agents
- Use residential/mobile proxies
- Integrate CAPTCHA solvers when needed
7. Error Handling
Network errors are inevitable. Without retries, crawlers fail silently.
Example with retries:
python
import requests, time
def fetch(url, retries=3):
for i in range(retries):
try:
return requests.get(url, timeout=5)
except requests.exceptions.RequestException as e:
print(f"Error: {e}, retrying {i+1}")
time.sleep(2**i)
return None
8. Scalability Challenges
A crawler that works for 1,000 pages may fail at 10M.
Distributed crawling example with Scrapy + Redis:
bash
scrapy runspider crawler.py -s JOBDIR=crawls/job1
Use:
- Redis/Kafka for distributed task queues
- Scrapy Cluster / Nutch for scaling
- Cloud storage for crawl results
9. Data Quality Issues
Crawled data may contain duplicates, empty fields, or invalid formats.
Solution: Schema validation
python
from pydantic import BaseModel, ValidationError
class Product(BaseModel):
name: str
price: float
try:
item = Product(name="Laptop", price="not a number")
except ValidationError as e:
print(e)
10. Security and Compliance
Crawlers must avoid scraping PII or restricted data.
Always check GDPR/CCPA compliance before storing user data.
Conclusion
Building a robust crawler requires technical precision and ethical responsibility. By addressing pitfalls like aggressive crawling, JavaScript rendering, anti-bot defenses, and scalability, developers can design crawlers that are:
- Efficient (optimized resource usage)
- Resilient (error-tolerant)
- Compliant (legal and ethical)
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.