网络爬虫的常见陷阱和解决方案（附代码示例）

Michael Lee

Expert Network Defense Engineer

16-Sep-2025

网络爬虫不仅仅是发送HTTP请求——它们还必须处理JavaScript渲染、防机器人机制、可扩展性和错误处理。在本文中，我们将探讨开发人员在构建爬虫时面临的常见陷阱，并提供具有代码片段的实用解决方案。

1. 忽视Robots.txt和爬取政策

如果你的爬虫忽略robots.txt，你可能面临法律问题或IP封锁。

坏的做法：

python Copy

import requests

html = requests.get("https://example.com").text
# 没有检查robots.txt

更好的做法：

python Copy

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if rp.can_fetch("*", "https://example.com/page"):
    print("允许爬取")
else:
    print("被robots.txt禁止")

✅ 始终尊重爬取政策并实施速率限制。

2. 爬取过于激进

每秒发送数千个请求是被封禁的快速方式。

解决方案：

添加延迟
使用异步爬取以提高效率

python Copy

import asyncio, aiohttp, random

async def fetch(session, url):
    async with session.get(url) as resp:
        return await resp.text()

async def main():
    urls = ["https://example.com/page1", "https://example.com/page2"]
    async with aiohttp.ClientSession() as session:
        for url in urls:
            html = await fetch(session, url)
            print(len(html))
            await asyncio.sleep(random.uniform(1, 3))  # 礼貌延迟

asyncio.run(main())

3. 处理JavaScript渲染的内容

静态爬虫会错过JS重的页面（React、Vue、Angular）。

解决方案：使用无头浏览器（例如Playwright、Puppeteer）。

python Copy

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    print(page.content())  # 现在包含JS渲染的内容
    browser.close()

4. 数据提取效率低下

硬编码脆弱选择器会导致爬虫崩溃。

更好的做法：使用BeautifulSoup + 备用方案：

python Copy

from bs4 import BeautifulSoup

html = "<div><h1 class='title'>Hello</h1></div>"
soup = BeautifulSoup(html, "lxml")

# 主要选择器
title = soup.select_one("h1.title")

# 备用
if not title:
    title = soup.find("h1")

print(title.text)

5. 重复内容收集

像/page?id=123&session=abc这样的URL可能会导致重复。

解决方案：规范化URL

python Copy

from urllib.parse import urlparse, urlunparse

def normalize(url):
    parsed = urlparse(url)
    clean = parsed._replace(query="")
    return urlunparse(clean)

print(normalize("https://example.com/page?id=1&session=xyz"))
# -> https://example.com/page

6. IP封锁和防机器人机制

网站通过速率异常、指纹和验证码检测机器人。

使用Scrapy进行基本轮换：

python Copy

class RotateUserAgentMiddleware:
    user_agents = [
        "Mozilla/5.0 ...",
        "Chrome/91.0 ...",
        "Safari/537.36 ..."
    ]

    def process_request(self, request, spider):
        import random
        request.headers['User-Agent'] = random.choice(self.user_agents)

解决方案堆栈：

轮换代理和用户代理
使用住宅/移动代理
在需要时集成验证码求解器

7. 错误处理

网络错误是不可避免的。如果不重试，爬虫会无声失败。

带重试的例子：

python Copy

import requests, time

def fetch(url, retries=3):
    for i in range(retries):
        try:
            return requests.get(url, timeout=5)
        except requests.exceptions.RequestException as e:
            print(f"错误: {e}, 重试 {i+1}")
            time.sleep(2**i)
    return None

8. 可扩展性挑战

一个适用于1000个页面的爬虫可能在1000万时失败。

使用Scrapy + Redis的分布式爬取示例：

bash Copy

scrapy runspider crawler.py -s JOBDIR=crawls/job1

使用：

Redis/Kafka 进行分布式任务队列
Scrapy Cluster / Nutch 进行扩展
云存储储存爬取结果

9. 数据质量问题

爬取的数据可能包含重复、空字段或无效格式。

解决方案：模式验证

python Copy

from pydantic import BaseModel, ValidationError

class Product(BaseModel):
    name: str
    price: float

try:
    item = Product(name="Laptop", price="not a number")
except ValidationError as e:
    print(e)