A git pre-commit hook for Jupyter notebooks

In my (limited) workflow with Jupyter notebooks, I typically track notebooks in git. One drawback of naively tracking notebook files is that one can commit the output (and other metadata) cells as well as the input cells. Committing the output and metadata cells contributes a significant amount of “diff noise”.

A popular solution is to implement a filter to automagically strip the output and metadata cells before the commit. For reference, I like Tim Staley’s post on the topic.

I would prefer to be presented with a simple warning before I commit a notebook with output cells. If I don’t want to commit the “unclean” notebooks, it’s not much of a burden to Cell -> All Output -> Clear and save. Indeed, there are cases where I want to commit all the output cells (e.g., GitHub renders notebooks). This can also be done with the filter approach mentioned above.

So, I wrote a quick git pre-commit hook to search for possible output cells in the .ipynb files staged for commit and asks the user if they really do want to commit them. To use the script, place it in {repo_root}/.git/hooks/pre-commit, where {repo_root} is the root directory of your git repo.

Update 1 June 2018: fixed handling of filenames with spaces (what kind of maniac puts spaces in filenames anyway?!)

Update 7 Mar 2019: changed the file grepping to use the file as it stands in git’s index instead of using the file on the filesystem, since they are not necessarily the same. This fixes the following issue. Suppose you save a notebook with output, stage it for commit, try to commit and get the warning message. You then clear the output and save the notebook. You then try to commit, and since the file is updated on the filesystem, the (old version of) the hook sees that there is no output, so there is no warning. But the index still has notebook output. This update should fix that issue.