Member-only story

#59 Web scraping (part 3): Splash

Hang Nguyen
3 min readJun 23, 2022

--

Brief introduction

Splash is a headless browser that executes JavaScript for people crawling websites. It is open source and fully integrated with Scrapy and Portia.

Some of its features include:

  • process multiple webpages in parallel;
  • get HTML results and/or take screenshots;
  • turn OFF images or use Adblock Plus rules to make rendering faster;
  • execute custom JavaScript in page context;
  • write Lua browsing scripts;
  • develop Splash Lua scripts in Splash-Jupyter Notebooks.
  • get detailed rendering info in HAR format.

Installation

Make sure that Docker Desktop is available on your local machine.

Please refer to this documentation for more information.

A small and fun practical project

REQUIREMENT: Make sure you have Anaconda.Navigator and Visual Studio Code downloaded to your local machine.

BIG STEP: Create a new virtual environment in Anaconda.Navigator and then open terminal. Create a new project directory, inside this project then create a new scrapy project named “ABC” by startproject and…

--

--

Hang Nguyen
Hang Nguyen

Written by Hang Nguyen

Just sharing (data) knowledge

No responses yet