#56 Web scraping (part 1): Scrapy theory
Scrapy is a Python framework for web scraping that provides a complete package for developers without worrying about maintaining code.
- Scrapy engine: control data flow between all components of the system and trigger events when certain actions occur.
- Scheduler: receive requests from the enginer and enqueue them for feeding them later (also to the engine) when the engine requests them.
- Spider: custom class written by Scrapy users to parse responses and extract items (aka scraped items) from them or additional URLs (requests) to follow. Each spider is able to handle a specific domain (or group of domains).
- Pipeline: process the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database).
- Middlewares: specific hooks that sit between the Engine and the Spiders and are able to process spider input (responses) and output (items and requests). They provide a convenient mechanism for extending Scrapy functionality by plugging custom code.
Command line tool
Please refer to this document for more information.
The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider. It’s meant to be used for testing data extraction code, but you can actually use it for testing any kind of code as it is also a regular Python shell.
The shell is used for testing XPath or CSS expressions and see how they work and what data they extract from the web pages you’re trying to scrape. It allows you to interactively test your expressions while you’re writing your spider, without having to run the spider to test every change.
It is considered to be an invaluable tool for developing and debugging your spiders.
XPath and CSS expressions
XPath stands for XML Path Language. It uses a non-XML syntax to provide a flexible way of addressing (pointing to) different parts of an XML document. It can also be used to test addressed nodes within a document to determine whether they match a pattern or not. XPath cheatsheet can be found here.
A CSS selector selects the HTML element(s) for styling purpose. CSS selectors select HTML elements according to its id, class, type, attribute etc.
There are many basic different types of selectors.
- Element Selector
- Id Selector
- Class Selector
- Universal Selector
- Group Selector
In the next part. let’s take a closer look at creating a spider :D