#58 Web scraping (part 2): Spider generation

Hang Nguyen
3 min readJun 23, 2022

--

In this part, let’s generate a simple dataset out from a website by creating a spider.

Create virtual environment

  • Make sure Anaconda, Python, and Visual Studio Code are downloaded to your local machine.
  • Create a virtual environment in Anaconda.navigator
  • In Terminal type these commands:

conda install -c anaconda scipy

<go to path where you want to save this project folder> mkdir projects

cd projects

scrapy startproject worldometers

cd worldometers

scrapy genspider countries www.worldometers.info/world-population/population-by-country/

  • Go to folder that you save your folder “projects”

open “spiders” folder and open file “countries.py” that you have just generated above.

Create a spider to get countries’ name, year and population

Inside “countries.py” file:

Source code

Run in terminal:

<folder contains “countries.py” file> scrapy crawl countries

Save results after scraping to csv file (can be json, xml according to your own preferences):

scrapy crawl countries -o population_by_countries.csv

Debug Spider

There are 4 main methods:

  • Parse Command: The most basic way to check spider output is to use the parse command. It allows checking the behavior of different parts of the spider at the method level. It has the advantages of being flexible and easy to use, but it does not allow debugging code within a method.
  • Scrapy Shell: Scrapy shell is a full-featured Python shell loaded with the same context that you would get in your spider callback methods. You just have to provide an URL and Scrapy Shell will let you interact with the same objects that your spider handles in its callbacks, including the response object. Although the parse command is very useful for checking the behavior of the spider, it is not helpful to check what is happening inside the callback other than displaying the response and output received. This is when Scrapy Shell comes into action.
  • Open in browser: Sometimes you just want to view the appearance of a response in the browser, you can use the open_in_browser function.
  • Logging: Logging is another useful option for obtaining information about the operation of the spider. Although inconvenient, it has the following advantages: if the logs are needed again in the future, they will be available in all future run.

NOTE: The thing about spiders is that there is high chance that the spider that you have built today will not function anymore in the future…

--

--

Hang Nguyen
Hang Nguyen

Written by Hang Nguyen

Just sharing (data) knowledge

No responses yet