AI Skill Library

Scrapy Framework

Spiders, selectors, pipelines, middlewares, crawl rules, distributed crawling.

scrapypythonscrapingcrawler
# Scrapy Framework

## Install & create project
```bash
pip install scrapy
scrapy startproject myspider
cd myspider
scrapy genspider articles example.com
```

## Project structure
```
myspider/
  spiders/
    articles.py      # spider logic
  items.py           # data model
  pipelines.py       # data processing
  middlewares.py     # request/response hooks
  settings.py        # config
```

## Spider
```python
import scrapy

class ArticleItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    date = scrapy.Field()
    content = scrapy.Field()

class ArticlesSpider(scrapy.Spider):
    name = 'articles'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/news']

    def parse(self, response):
        for card in response.css('.article-card'):
            yield response.follow(
                card.css('a::attr(href)').get(),
                callback=self.parse_article,
            )
        # Pagination
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_article(self, response):
        yield ArticleItem(
            title=response.css('h1::text').get().strip(),
            url=response.url,
            date=response.css('.date::text').get(),
            content=' '.join(response.css('.content p::text').getall()),
        )
```

## CrawlSpider (link rules)
```python
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BlogSpider(CrawlSpider):
    name = 'blog'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/']
    rules = [
        Rule(LinkExtractor(allow=r'/blog/\d+'), callback='parse_post'),
        Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
    ]
    def parse_post(self, response):
        yield { 'title': response.css('h1::text').get(), 'url': response.url }
```

## Pipeline (save to DB/file)
```python
# pipelines.py
import json

class JsonPipeline:
    def open_spider(self, spider):
        self.file = open('output.jsonl', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False)
        self.file.write(line + '\n')
        return item

    def close_spider(self, spider):
        self.file.close()

class PostgresPipeline:
    def process_item(self, item, spider):
        self.cursor.execute(
            'INSERT INTO articles (title, url, date) VALUES (%s, %s, %s)',
            (item['title'], item['url'], item['date'])
        )
        return item
```
```python
# settings.py
ITEM_PIPELINES = {'myspider.pipelines.JsonPipeline': 300}
```

## Settings
```python
# settings.py
DOWNLOAD_DELAY = 1          # seconds between requests
CONCURRENT_REQUESTS = 8
ROBOTSTXT_OBEY = True
DEFAULT_REQUEST_HEADERS = {'Accept-Language': 'zh-CN,zh;q=0.9'}
USER_AGENT = 'Mozilla/5.0 ...'
# Rotating proxies middleware
DOWNLOADER_MIDDLEWARES = {'scrapy_rotating_proxies.middlewares.RotatingProxyMiddleware': 610}
ROTATING_PROXY_LIST = ['http://proxy1:8080', 'http://proxy2:8080']
```

## Run
```bash
scrapy crawl articles                        # run spider
scrapy crawl articles -o output.json        # save JSON
scrapy crawl articles -o output.csv         # save CSV
scrapy shell 'https://example.com'          # interactive debug
scrapy check articles                        # check contracts
```

API: /api/skills/scrapy-framework