Crawler Anti-Detection
Proxies, fingerprint spoofing, CAPTCHA handling, rate limits, cookie management.
scrapingcrawleranti-detectionpython
# Crawler Anti-Detection
## Detection mechanisms (what sites detect)
- IP rate limiting / reputation
- User-Agent fingerprint
- Browser fingerprint (canvas, WebGL, fonts, screen)
- Missing human behavior (mouse moves, scrolls, timing)
- Missing HTTP headers normal browsers send
- TLS fingerprint (JA3 hash)
- Headless browser signals (`navigator.webdriver = true`)
## Proxy rotation
```python
import random, requests
class ProxyPool:
def __init__(self, proxies):
self.proxies = proxies
self.failed = set()
def get(self):
available = [p for p in self.proxies if p not in self.failed]
return random.choice(available) if available else None
def mark_failed(self, proxy):
self.failed.add(proxy)
pool = ProxyPool([
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
])
def fetch_with_retry(url, max_retries=3):
for _ in range(max_retries):
proxy = pool.get()
try:
resp = requests.get(url, proxies={'http': proxy, 'https': proxy}, timeout=10)
if resp.status_code == 200:
return resp
except Exception:
pool.mark_failed(proxy)
raise Exception('All proxies failed')
```
## Proxy providers
- **Residential**: Bright Data, Oxylabs, Smartproxy (real IPs, expensive)
- **Datacenter**: many providers (fast, cheap, easier to detect)
- **Mobile**: highest trust score, most expensive
## Request headers (mimic real browser)
```python
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-CH-UA': '"Chromium";v="124", "Google Chrome";v="124"',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0',
}
```
## Playwright stealth
```ts
import { chromium } from 'playwright-extra'
import StealthPlugin from 'puppeteer-extra-plugin-stealth'
chromium.use(StealthPlugin())
const context = await browser.newContext({
userAgent: realUA,
viewport: { width: 1366, height: 768 },
locale: 'zh-CN',
timezoneId: 'Asia/Shanghai',
geolocation: { latitude: 31.23, longitude: 121.47 },
permissions: ['geolocation'],
})
// Override webdriver flag
await page.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => undefined })
})
// Human-like behavior
await page.mouse.move(100 + Math.random()*100, 200 + Math.random()*100)
await page.waitForTimeout(500 + Math.random() * 1500)
await page.keyboard.type('search text', { delay: 80 + Math.random() * 60 })
```
## CAPTCHA handling
```python
# 2captcha service
import requests
def solve_recaptcha(site_key, page_url):
# Submit
resp = requests.post('http://2captcha.com/in.php', data={
'key': API_KEY, 'method': 'userrecaptcha',
'googlekey': site_key, 'pageurl': page_url,
})
task_id = resp.text.split('|')[1]
# Poll for result
import time
time.sleep(15)
for _ in range(10):
result = requests.get(f'http://2captcha.com/res.php?key={API_KEY}&action=get&id={task_id}')
if result.text.startswith('OK|'):
return result.text.split('|')[1]
time.sleep(5)
```
## Rate limiting & politeness
```python
import time, random
from functools import wraps
def rate_limit(min_delay=1.0, max_delay=3.0):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
result = func(*args, **kwargs)
time.sleep(random.uniform(min_delay, max_delay))
return result
return wrapper
return decorator
@rate_limit(1, 4)
def fetch_page(url):
return session.get(url)
```
## Cookie & session persistence
```python
import pickle
# Save cookies
with open('cookies.pkl', 'wb') as f:
pickle.dump(session.cookies, f)
# Load cookies
with open('cookies.pkl', 'rb') as f:
session.cookies.update(pickle.load(f))
```API: /api/skills/crawler-anti-detection