Scrapy Framework
Spiders, selectors, pipelines, middlewares, crawl rules, distributed crawling.
scrapypythonscrapingcrawler
# Scrapy Framework
## Install & create project
```bash
pip install scrapy
scrapy startproject myspider
cd myspider
scrapy genspider articles example.com
```
## Project structure
```
myspider/
spiders/
articles.py # spider logic
items.py # data model
pipelines.py # data processing
middlewares.py # request/response hooks
settings.py # config
```
## Spider
```python
import scrapy
class ArticleItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
date = scrapy.Field()
content = scrapy.Field()
class ArticlesSpider(scrapy.Spider):
name = 'articles'
allowed_domains = ['example.com']
start_urls = ['https://example.com/news']
def parse(self, response):
for card in response.css('.article-card'):
yield response.follow(
card.css('a::attr(href)').get(),
callback=self.parse_article,
)
# Pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_article(self, response):
yield ArticleItem(
title=response.css('h1::text').get().strip(),
url=response.url,
date=response.css('.date::text').get(),
content=' '.join(response.css('.content p::text').getall()),
)
```
## CrawlSpider (link rules)
```python
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BlogSpider(CrawlSpider):
name = 'blog'
allowed_domains = ['example.com']
start_urls = ['https://example.com/']
rules = [
Rule(LinkExtractor(allow=r'/blog/\d+'), callback='parse_post'),
Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
]
def parse_post(self, response):
yield { 'title': response.css('h1::text').get(), 'url': response.url }
```
## Pipeline (save to DB/file)
```python
# pipelines.py
import json
class JsonPipeline:
def open_spider(self, spider):
self.file = open('output.jsonl', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False)
self.file.write(line + '\n')
return item
def close_spider(self, spider):
self.file.close()
class PostgresPipeline:
def process_item(self, item, spider):
self.cursor.execute(
'INSERT INTO articles (title, url, date) VALUES (%s, %s, %s)',
(item['title'], item['url'], item['date'])
)
return item
```
```python
# settings.py
ITEM_PIPELINES = {'myspider.pipelines.JsonPipeline': 300}
```
## Settings
```python
# settings.py
DOWNLOAD_DELAY = 1 # seconds between requests
CONCURRENT_REQUESTS = 8
ROBOTSTXT_OBEY = True
DEFAULT_REQUEST_HEADERS = {'Accept-Language': 'zh-CN,zh;q=0.9'}
USER_AGENT = 'Mozilla/5.0 ...'
# Rotating proxies middleware
DOWNLOADER_MIDDLEWARES = {'scrapy_rotating_proxies.middlewares.RotatingProxyMiddleware': 610}
ROTATING_PROXY_LIST = ['http://proxy1:8080', 'http://proxy2:8080']
```
## Run
```bash
scrapy crawl articles # run spider
scrapy crawl articles -o output.json # save JSON
scrapy crawl articles -o output.csv # save CSV
scrapy shell 'https://example.com' # interactive debug
scrapy check articles # check contracts
```API: /api/skills/scrapy-framework