#65 How to connect Pyspark to MongoDB & eliminate the bug of not finding source “mongo”

Hang Nguyen
2 min readJul 11, 2022

--

The thing with Spark is that it gets updated all the time with new versions, new updates, new features, and all sorts of new configurations to other sources because other sources get updated too.

While trying to connect PySpark with MongoDB, I encountered the problem of not finding the source “mongo” several times and so do several developers on Stack Overflow. Lucikly, I bypassed this encounter with tons of trials and failure. This motivates me to share the knowledge with you and also acts as a reminder for me in the future: Technology changes so fast, that the knowlegde you know today may not be applied seeminglessly to maybe 2 years ahead. Being a developer means that you have to constantly update your knowledge on both new technology and updates of current technolgy.

So, what we are waiting for, let’s get started!

Requirements (strictly)

  • Latest version of MongoDB and also MongoDB Compass for UI purpose.
  • Latest version of Spark, in this case I use PySpark.
  • Latest version of Python and Java on your local machine
  • Remember to add SPARK_HOME, HADOOP_HOME, JAVA_HOME to your environments and add to the Path as follows (skip step 4,5,8) https://phoenixnap.com/kb/install-spark-on-windows-10

Note: Open Anaconda Prompt and run command “pyspark” to check if pyspark is successfully downloaded.

Let’s connect

Note: Answers from Stack Overflow are quite obsolete, not having “mongo” as source maybe 2 years ago can be solved by following the instructions there, but at least for now not working for me. Lesson learnt: sometimes have to update to latest version to avoid uncomfortable errors.

pip install pysparkimport pysparkfrom pyspark import SparkContext, SparkConffrom pyspark.sql import SparkSession, SQLContext, functions as Ffrom pyspark.sql.functions import *
# create a spark session
spark = SparkSession \.builder \.master("local") \.appName("ABC") \.config("spark.driver.memory", "15g") \.config("spark.mongodb.read.connection.uri", "mongodb://localhost:27017/cool") \.config("spark.mongodb.write.connection.uri", "mongodb://localhost:27017/cool") \.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.2') \.getOrCreate()# read data from mongodb collection "questions" into a dataframe "df"df = spark.read \.format("mongodb") \.option("uri", "mongodb://localhost:27017/cool") \.option("database", "cool") \.option("collection", "questions") \.load()df.printSchema()
  • Make sure to use the latest MongoDB connector > 10.0 version, so it can work harmoniously with latest MongoDB connector. (connector 3.0 and below are obsolete and can lead to situation that sometimes dataframe can be read, sometimes can not).
.config("spark.mongodb.read.connection.uri", "mongodb://localhost:27017/cool") \.config("spark.mongodb.write.connection.uri", "mongodb://localhost:27017/cool") \.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.2') \
  • The source name is no longer “mongo” but “mongodb”. Also need to clarify the option of database and collection name. (Without them, raise error of missing database name)
df = spark.read \.format("mongodb") \.option("uri", "mongodb://localhost:27017/cool") \.option("database", "cool") \.option("collection", "questions") \.load()

--

--

Hang Nguyen
Hang Nguyen

Written by Hang Nguyen

Just sharing (data) knowledge

Responses (2)