AI Skill Library

Python Web Scraping

Requests, BeautifulSoup, lxml, parsing HTML/JSON, headers, sessions, rate limiting.

pythonscrapingcrawlerbackend
# Python Web Scraping

## Install
```bash
pip install requests beautifulsoup4 lxml httpx
```

## Basic fetch + parse
```python
import requests
from bs4 import BeautifulSoup

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept': 'text/html,application/xhtml+xml,*/*',
}

session = requests.Session()
session.headers.update(HEADERS)

resp = session.get('https://example.com/articles', timeout=10)
resp.raise_for_status()  # raise on 4xx/5xx

soup = BeautifulSoup(resp.text, 'lxml')

# CSS selectors
titles = [el.text.strip() for el in soup.select('h2.article-title')]
# XPath via lxml
from lxml import etree
tree = etree.HTML(resp.text)
links = tree.xpath('//a[@class="read-more"]/@href')
```

## Pagination loop
```python
import time, random

base_url = 'https://example.com/news?page={}'
results = []

for page in range(1, 20):
    resp = session.get(base_url.format(page))
    soup = BeautifulSoup(resp.text, 'lxml')
    items = soup.select('.item')
    if not items:
        break
    for item in items:
        results.append({
            'title': item.select_one('h3').text.strip(),
            'link':  item.select_one('a')['href'],
            'date':  item.select_one('.date').text.strip(),
        })
    time.sleep(random.uniform(1, 3))  # polite delay
```

## JSON API scraping
```python
resp = session.get('https://api.example.com/data', params={'limit': 100, 'offset': 0})
data = resp.json()
items = data['results']
```

## Async scraping (httpx + asyncio)
```python
import httpx, asyncio

async def fetch(client, url):
    resp = await client.get(url, timeout=10)
    return resp.text

async def main(urls):
    async with httpx.AsyncClient(headers=HEADERS, follow_redirects=True) as client:
        tasks = [fetch(client, url) for url in urls]
        pages = await asyncio.gather(*tasks, return_exceptions=True)
    return pages

pages = asyncio.run(main(url_list))
```

## Anti-block techniques
```python
import random
# Rotate User-Agents
UA_LIST = ['Mozilla/5.0 ...', 'Chrome/120 ...', 'Safari/17 ...']
session.headers['User-Agent'] = random.choice(UA_LIST)

# Rotate proxies
PROXIES = ['http://proxy1:port', 'http://proxy2:port']
proxy = random.choice(PROXIES)
resp = session.get(url, proxies={'http': proxy, 'https': proxy})

# Respect robots.txt
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if rp.can_fetch('*', url):
    resp = session.get(url)
```

## Save data
```python
import json, csv

# JSON
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

# CSV
import csv
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'link', 'date'])
    writer.writeheader()
    writer.writerows(results)
```

API: /api/skills/python-web-scraping