Member-only story
#106 Databricks notebook with pip install
I could not believe my eyes that this simple command pip install can cause that much trauma to my data engineer career. Period.
We all know that in order to use some Python package in Databricks notebook, we often prompt to use%pip install
. This is true in theory, but in practice we can get errors from placing it in the wrong place. Also, with the feature Severless in Databricks, it can be a different story. Let's dwell right to cautions when using pip install in Databricks notebook that won't cause that much trauma like the one I did experience.
Difference between !pip install and %pip install
You may confuse which one to use before pip install, as a single symbol does make a difference. Lucky for you that we usually opt for just one.
The topmost option to use in Databricks notebook is%pip install
since it can guarantee to install packages into the Python environment of the running notebook kernel. Meanwhile, !pip install
cannot guarantee the same feature as it does not always interact with the kernel depending on the configuration.
An example of use is as follows:
%pip install pandas
What can go wrong with this %pip install
It is okay to place this %pip install
command with other commands in the same Python code cell in Databricks notebook when you just run the notebook yourself in Databricks. This won't cause any harm or errors when you want to use specific Python package within your notebook. However, this is not the case when you involve the ADF pipeline with the Python notebook as an activity. It will throw an error for sure when you place this command in the same code cell with other commands.
When you receive this kind of error, your intuition may guide you to check if the package has been installed on the cluster or if there must be something wrong with the package itself. You won't immediately find out that it has to do with where you place %pip install
command, will you?
It is a must to place this%pip install
command in a separate code cell in Databricks notebook. Why? In Databricks, the code cell runs in the current kernel’s state. If you try to install a package and use it in the same execution context, the kernel may need to…