Member-only story

#106 Databricks notebook with pip install

3 min readOct 18, 2024

--

I could not believe my eyes that this simple command pip install can cause that much trauma to my data engineer career. Period.

We all know that in order to use some Python package in Databricks notebook, we often prompt to use%pip install. This is true in theory, but in practice we can get errors from placing it in the wrong place. Also, with the feature Severless in Databricks, it can be a different story. Let's dwell right to cautions when using pip install in Databricks notebook that won't cause that much trauma like the one I did experience.

Difference between !pip install and %pip install

You may confuse which one to use before pip install, as a single symbol does make a difference. Lucky for you that we usually opt for just one.

The topmost option to use in Databricks notebook is%pip install since it can guarantee to install packages into the Python environment of the running notebook kernel. Meanwhile, !pip install cannot guarantee the same feature as it does not always interact with the kernel depending on the configuration.

An example of use is as follows:

%pip install pandas

What can go wrong with this %pip install

It is okay to place this %pip installcommand with other commands in the same Python code cell in Databricks notebook when you just run the notebook yourself in Databricks. This won't cause any harm or errors when you want to use specific Python package within your notebook. However, this is not the case when you involve the ADF pipeline with the Python notebook as an activity. It will throw an error for sure when you place this command in the same code cell with other commands.

When you receive this kind of error, your intuition may guide you to check if the package has been installed on the cluster or if there must be something wrong with the package itself. You won't immediately find out that it has to do with where you place %pip install command, will you?

It is a must to place this%pip install command in a separate code cell in Databricks notebook. Why? In Databricks, the code cell runs in the current kernel’s state. If you try to install a package and use it in the same execution context, the kernel may need to…

--

--

Hang Nguyen
Hang Nguyen

Written by Hang Nguyen

Just sharing (data) knowledge

No responses yet