#58 Web scraping (part 2): Spider generation
In this part, let’s generate a simple dataset out from a website by creating a spider.
Create virtual environment
- Make sure Anaconda, Python, and Visual Studio Code are downloaded to your local machine.
- Create a virtual environment in Anaconda.navigator
- In Terminal type these commands:
conda install -c anaconda scipy
<go to path where you want to save this project folder> mkdir projects
cd projects
scrapy startproject worldometers
cd worldometers
scrapy genspider countries www.worldometers.info/world-population/population-by-country/
- Go to folder that you save your folder “projects”
open “spiders” folder and open file “countries.py” that you have just generated above.
Create a spider to get countries’ name, year and population
Inside “countries.py” file:
Run in terminal:
<folder contains “countries.py” file> scrapy crawl countries
Save results after scraping to csv file (can be json, xml according to your own preferences):
scrapy crawl countries -o population_by_countries.csv
Debug Spider
There are 4 main methods:
- Parse Command: The most basic way to check spider output is to use the parse command. It allows checking the behavior of different parts of the spider at the method level. It has the advantages of being flexible and easy to use, but it does not allow debugging code within a method.
- Scrapy Shell: Scrapy shell is a full-featured Python shell loaded with the same context that you would get in your spider callback methods. You just have to provide an URL and Scrapy Shell will let you interact with the same objects that your spider handles in its callbacks, including the response object. Although the parse command is very useful for checking the behavior of the spider, it is not helpful to check what is happening inside the callback other than displaying the response and output received. This is when Scrapy Shell comes into action.
- Open in browser: Sometimes you just want to view the appearance of a response in the browser, you can use the open_in_browser function.
- Logging: Logging is another useful option for obtaining information about the operation of the spider. Although inconvenient, it has the following advantages: if the logs are needed again in the future, they will be available in all future run.
NOTE: The thing about spiders is that there is high chance that the spider that you have built today will not function anymore in the future…