Member-only story

#69 PySpark in action with MongoDB

Hang Nguyen
2 min readJul 24, 2022

--

After learning Spark and MongoDB separately, how about connect these two together to do some data analysis on large amount of data (around 4 GB of data)?

It’s absolutely possible to do so :)

Prerequisites

In order to make this happen, you should have Spark installed to your local machine. Tutorials about setting up Spark can be found easily on the internet regardless of your operating systems. I will skip that part so that you can explore yourself. I myself use Spark 3.3.0, the newest version available.

Besides that, make sure to have MongoDB installed as well. This is crucial. I also use MongoDB Compass for better UI experience. It’s also okay to use just mongosh to query from database.

And last but not least, PySpark should also be installed, preferably the same version as Spark.

I just forgot Jupyter Notebook, make sure it’s also available on port 8888.

Let’s make it happen

You first need to import everything, literally everything to Jupyter Notebook.

Then, make sure to declare your SPARK_HOME, JAVE_HOME and HADOOP_HOME as you have in your own local environments.

--

--

Hang Nguyen
Hang Nguyen

Written by Hang Nguyen

Just sharing (data) knowledge

No responses yet