Clearing Output Data in Jupyter Notebooks using Pre-commit Framework

Recently, I have participated in a project at AI Superior aimed at the analysis of a dataset with sensitive data. So as the data have to remain private, initially we shared the dataset through a secure channel and took measures to prevent its accidental distribution (we put the dataset in a separate directory and configured git to ignore this folder and other directories containing intermediate processing results). However, working on this project I have noticed that Jupyter notebook, that is a kind of standard tool used for data analysis, may be a source of sensitive data leakage.

Update 27/10/2020: I have developed a git hook to clear Jupyter output cells data that does not rely on pre-commit framework. You can find the description of the approach in the article “Clearing Output Data in Jupyter Notebooks using a Bash Script Hook”.

Update 31/10/2020: I have found a better approach to clear Jupyter output cells data that relies on git attributes. You can find my latest finding in the article “Clearing Output Data in Jupyter Notebooks using Git Attributes”.

Table of Contents

Problem

The issue is that Jupyter notebook besides the code also stores the output produced when the code cells are executed. The output may contain pieces of sensitive information, therefore if the notebook is shared publicly anyone can read this private data.

To exemplify this issue, let us consider the following artificial project (you can find it here). Let us imagine that the dataset containing private data private_dataset.csv is stored in a directory dataset/. This directory is added to .gitignore file, thus the files in it are ignored by git and do not appear in the list of files for staging. Hence, the dataset remains local.

private_dataset.csv

first_column,second_column
0,10
1,11
2,12
3,13
4,14
5,15
6,16
7,17
8,18
9,19

.gitignore

# ignoring auxiliary directories and files
.ipynb_checkpoints
.venv
.directory

# ignoring data in the dataset directory
dataset/

Now, let us create a simple Jupyter notebook to analyze this dataset. Typically, the first actions during the analysis are the following:

Data Analysis Notebook

import pandas as pd
df = pd.read_csv('dataset/private_dataset.csv')
df.head()

first_columnsecond_column
0010
1111
2212
3313
4414

We load the dataset into a dataframe using the pandas.read_csv() function and then check that the dataset is loaded properly, e.g., by calling the pandas.DataFrame.head() function. As a result, Jupyter outputs the first 5 entries from the dataset. When you save the notebook, Jupyter stores this output. Therefore, people who have access to the notebook may get access to the portions of sensitive data (although the dataset itself is not shared).

The obvious solution to this issue is to clear all outputs before committing the notebook to version control system. When you have Jupyter notebook opened, you can do this by selecting the Cell -> All Output -> Clear menu item. However, based on my experience sometimes you forget to do this, thus making the data leakage that is hard to plumb (once commit is pushed to the public repository it is very hard to remove it). Therefore, to prevent this we must not commit notebooks that contain output data.

An additional benefit of such prevention system is the reduction of polluting commits in the repository history. Indeed, instead of, e.g., pandas.DataFrame.head() I often use the pandas.DataFrame.sample() function that instead of n top records from a dataset outputs n rows chosen randomly. Therefore, each execution of such notebook produces different output that is required to be processed by git (committed or discarded).

Solution

In order to address this issue, let us make a pipeline that automatically checks every commit and removes Jupyter notebook output data. In order to develop such system we will make use of the git hooks functionality. Git provides facilities to execute some custom scripts before or after some git actions are carried out. These facilities and the scripts are called hooks.

You can develop these hook scripts using your system’s scripting facilities (check script samples in the .git/hooks directory), however I will use the pre-commit framework that simplifies this process considerably.

However, before using this tool we have to install this framework. The pre-commit framework is developed using Python, therefore we have to install it using this language toolsets. In the article, I have described how I configure my Python environment. Here, I will work within the same framework. However, you may need to adapt the instructions for your setup. At first, let us install pre-commit into the tool3 pyenv environment:

$ pyenv activate tools3
$ pip install pre-commit
$ pyenv deactivate

So as the tools3 environment activated globally, the path to the pre-commit executable is added to the PATH variable, and we can run this tool using only its name:

$ pyenv which pre-commit
/home/yury/.pyenv/versions/tools3/bin/pre-commit
$ pre-commit --version
pre-commit 2.4.0

Now, let us create a pre-commit hook to address our problem. In order to do this, in the root of your git repository create the .pre-commit-config.yaml file with the following content:

repos:
  - repo: local
    hooks:
      - id: jupyter-nb-clear-output
        name: jupyter-nb-clear-output
        files: \.ipynb$
        stages: [commit]
        language: system
        entry: jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace

This yaml code describes a new pre-commit hook. Let us consider what this configuration means. So as we are developing a new pre-commit hook, the repo option should be equal to local (the repo option specifies from which repository we should download pre-commit hooks). The hooks section describes what hooks should be used. In this case, we create a new hook with id equal to jupyter-nb-clear-output (the id is used to uniquely identify a hook) and with the same name (used for output purposes). The \.ipynb$ file pattern tells that the hook should be applied only to the files with the .ipynb extension (Jupyter notebook files). Additionally, it should be run only during the git commit stage (stage).

The last two options language and entry specify what language is used to install the hook and what executable to run correspondingly. As an entry point I use jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace. This command clears all Jupyter notebook output data applying the Jupyter nbconvert ClearOutputPreprocessor filter to a file and storing the result to the same file (the --inplace option). You can test this command on a Jupyter notebook:

$ jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace PoC.ipynb

So as the jupyter command is available globally in my setup, I do not need to provide full path to the executable. Moreover, I run this command without poetry run prefix because I do not need to use packages installed in the project virtual environment.

After the .pre-commit-config.yaml file is created, we have to activate the configuration installing the hook described in this file. The simplest way to do this is to run the following command in the root directory of the repository:

$ pre-commit install
pre-commit installed at .git/hooks/pre-commit

After executing this command, a new pre-commit file should be created in the .git/hooks directory. Note that you have to run this command each time you have modified the configuration file.

Testing the Solution

Now, let us test our solution. After all actions, my test repository has the following structure:

$ tree -a -I '.venv|.directory|.ipynb_checkpoints|.git'
.
├── dataset
│   └── private_dataset.csv
├── .gitignore
├── PoC.ipynb
├── poetry.lock
├── .pre-commit-config.yaml
└── pyproject.toml

Let us stage our PoC.ipynb notebook containing some output data and commit it to git:

$ git add PoC.ipynb 
$ git commit -m "Update notebook"
jupyter-nb-clear-output..................................................Failed
- hook id: jupyter-nb-clear-output
- duration: 0.62s
- files were modified by this hook

[NbConvertApp] Converting notebook PoC.ipynb to notebook
[NbConvertApp] Writing 1029 bytes to PoC.ipynb

As you can see when you try to commit git calls jupyter-nb-clear-output pre-commit hook. The commit fails because our jupyter-nb-clear-output pre-commit hook modifies the PoC.ipynb file. We can see this using the git status command:

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

        new file:   PoC.ipynb

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   PoC.ipynb

Therefore, we have to add the modified files once again and commit:

$ git add PoC.ipynb 
$ git commit -m "Update notebook"
jupyter-nb-clear-output..................................................Passed
- hook id: jupyter-nb-clear-output
- duration: 0.51s

[NbConvertApp] Converting notebook PoC.ipynb to notebook
[NbConvertApp] Writing 1029 bytes to PoC.ipynb

[master (root-commit) a6d61bd] Update notebook
 1 file changed, 61 insertions(+)
 create mode 100644 PoC.ipynb

Now, the commit process passes, and if you open the notebook you will see that it does not have any output.

Related