Python Web Scraping
Requests, BeautifulSoup, lxml, parsing HTML/JSON, headers, sessions, rate limiting.
pythonscrapingcrawlerbackend
# Python Web Scraping
## Install
```bash
pip install requests beautifulsoup4 lxml httpx
```
## Basic fetch + parse
```python
import requests
from bs4 import BeautifulSoup
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml,*/*',
}
session = requests.Session()
session.headers.update(HEADERS)
resp = session.get('https://example.com/articles', timeout=10)
resp.raise_for_status() # raise on 4xx/5xx
soup = BeautifulSoup(resp.text, 'lxml')
# CSS selectors
titles = [el.text.strip() for el in soup.select('h2.article-title')]
# XPath via lxml
from lxml import etree
tree = etree.HTML(resp.text)
links = tree.xpath('//a[@class="read-more"]/@href')
```
## Pagination loop
```python
import time, random
base_url = 'https://example.com/news?page={}'
results = []
for page in range(1, 20):
resp = session.get(base_url.format(page))
soup = BeautifulSoup(resp.text, 'lxml')
items = soup.select('.item')
if not items:
break
for item in items:
results.append({
'title': item.select_one('h3').text.strip(),
'link': item.select_one('a')['href'],
'date': item.select_one('.date').text.strip(),
})
time.sleep(random.uniform(1, 3)) # polite delay
```
## JSON API scraping
```python
resp = session.get('https://api.example.com/data', params={'limit': 100, 'offset': 0})
data = resp.json()
items = data['results']
```
## Async scraping (httpx + asyncio)
```python
import httpx, asyncio
async def fetch(client, url):
resp = await client.get(url, timeout=10)
return resp.text
async def main(urls):
async with httpx.AsyncClient(headers=HEADERS, follow_redirects=True) as client:
tasks = [fetch(client, url) for url in urls]
pages = await asyncio.gather(*tasks, return_exceptions=True)
return pages
pages = asyncio.run(main(url_list))
```
## Anti-block techniques
```python
import random
# Rotate User-Agents
UA_LIST = ['Mozilla/5.0 ...', 'Chrome/120 ...', 'Safari/17 ...']
session.headers['User-Agent'] = random.choice(UA_LIST)
# Rotate proxies
PROXIES = ['http://proxy1:port', 'http://proxy2:port']
proxy = random.choice(PROXIES)
resp = session.get(url, proxies={'http': proxy, 'https': proxy})
# Respect robots.txt
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if rp.can_fetch('*', url):
resp = session.get(url)
```
## Save data
```python
import json, csv
# JSON
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(results, f, ensure_ascii=False, indent=2)
# CSV
import csv
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['title', 'link', 'date'])
writer.writeheader()
writer.writerows(results)
```API: /api/skills/python-web-scraping