Find your Jupyter notebooks with ElasticSearch

Published in

Quilt

7 min readFeb 5, 2019

Have you ever wished to reuse a code snippet, but been unable to find that code? Often, that magical snippet is hidden away in a Jupyter notebook. Mac OS Spotlight is weak at searching Jupyter notebooks. Ditto for GitHub search. On UNIX systems, find and grep are better, but lack the simplicity and features of a modern search engine.

Wouldn’t it be useful if you could privately “Google” for your Jupyter notebooks by typing a few keywords? In this article, we’ll use ElasticSearch, Lambda, and nbformat to create a private, searchable library of Jupyter notebooks.

Although we’ll focus on indexing Jupyter notebooks (.ipynb), you can use the techniques in this article to index any text file, including Python (.py), Scala, markdown (.md), and plaintext (.txt) files.

All of the source code that you’ll need to index and search your own notebooks is available as part of T4, on GitHub.

Pre-requisites

The present article assumes a basic familiarity with Python and Amazon Web Services.

System architecture

Our notebook search system has three components:

B — An S3 Bucket
SC — An ElasticSearch cluster
indexer — A lambda function that parses notebooks and sends the results to SC

Data flows through the system as follows:

A notebook lands in B
indexer extracts relevant cells from B, adds metadata, then sends the result to SC

Once step 2. is complete, your notebook is searchable. As long as you can remember an interesting word from a notebook, ElasticSearch will find it for you (see Appendix for proof).

Try it on the web

You can try notebook search for yourself, below.

Figure 0 — Searching Jupyter notebooks for “random forest”.

Here are a few search terms, selected at random, that return one or more results:

You can also browse the six thousand notebooks in S3.

Overview

In the following sections we’ll examine the Python code that powers our search system, we’ll provide tips for developers who’d like to roll their own solution. In the Appendix, we’ll validate the design by showing that it achieves an F1 score of 99.5% for a wide variety of search terms.

Code walkthrough, tips for developers

Listen to S3 with a Lambda function

indexer is a Lambda function that listens for create and delete events on a bucket, B. At Quilt, we use a CloudFormation template to create indexer and attach it to B. Alternatively, you can click Add notification in the AWS S3 Console (Fig. 1).

*Figure 1 — AWS Console : S3: Your_bucket > Properties > Advanced Settings > Events*

Attaching Lambda functions to S3 Events can be flaky. First, you’ll need to avoid overlapping triggers. The path “prefix” setting can be used to disjoin triggers on a bucket. Our team has encountered multiple bugs with clashing triggers and hidden listeners in the Triggers section of the Lambda console. When in doubt, prefer the S3 console (Fig. 1) as the definitive list of notifications.

Below is a simplified version of the handler function that runs indexer. The key points are as follows:

look for the creation or deletion of .ipynb files
extract relevant cells from those files
post the results to SC

Extract code and markdown cells

After reviewing the Jupyter Notebook Format, we decided that the source field of code and markdown cells contained everything worth indexing. Fields like outputs seemed noisy, and less likely to contain human-friendly search strings. In the future, we might consider indexing raw cells, but at present we don’t use them.

Below, we call nbformat to normalize notebooks to version 4. We then access the resulting dict in a way that avoids exceptions (i.e. use dict.get, check for the presence of keys before accessing them).

You might be tempted to parse Jupyter Notebooks as JSON. Resist this temptation. There are too many inconsistencies across too many notebook versions for JSON parsing to be worth your time. nbformat is your friend.

Send documents to ElasticSearch

With the results of extract_text() in hand, we can post a document to SC as follows:

Set permissions on ElasticSearch (be careful!)

ElasticSearch permissions require care, so as not to leak data. The first thing to understand is that search results are “all or nothing.” If a user can search, he or she can see anything in the index, irrespective of whether or not they can see the underlying S3 objects.

At Quilt, we permission ElasticSearch clusters by granting permissions to specific IAM users or specific roles. Permission by role is useful if, for example, you want to call ElasticSearch from SageMaker. You can read more about permissions and ElasticSearch in Amazon’s developer docs.

Deploying the solution

The easiest way to deploy ElasticSearch for Jupyter notebooks in your own AWS account is as a T4 CloudFormation install (includes T4's data catalog for S3). In the near future, we will offer hosted and VPC versions of notebook search with sophisticated features for role-based access and previewing data in S3.

Conclusion

We’ve demonstrated how to use ElasticSearch to find any Jupyter notebook you’ve ever created — as long as you can remember a bit of nontrival markdown or code from that notebook. I trust that you’ll find this feature useful in making notebooks discoverable and reusable.

There’s a long way to go to make notebooks first-class citizens in S3, Azure, and GCP. We welcome your contributions to T4 on GitHub.

PS — If you’re curious just how accurate ElasticSearch is for Juptyer notebooks, check out the Appendix. TL;DR — precision and recall over 99%.

Appendix: Precision and recall

Above we explained how notebook search works. Now let’s test it.

I sampled 6,531 notebooks (2.59 GB) from the one-million notebook archive for Exploration and Explanation in Computational Notebooks. I then copied the notebooks to S3, where they were automatically captured by indexer and ES.

You can find complete Jupyter notebooks for the TFIDF analysis at quiltdata/examples/JupyterSearch.

In the remaining few paragraphs, I’ll demonstrate how to select interesting search terms from the document corpus, and how to measure precision and recall versus grep.

Note: We’ve tested single search terms for simplicity, but ElasticSearch is far more powerful and flexible than grep since ElasticSearch handles multiple search terms in any order, stemming, and relevance ranking.

TL;DR for single search terms, the system works perfectly, with near 100% precision and recall. There are some important limitations to ElasticSearch, nevertheless:

As configured, the indexer does not index integers (48885), but it does index hashes (a53458h)
sklearn, ElasticSearch, and grep all tokenize words slightly differently, leading to minor inconsistencies in search results

TFIDF to reveal candidate search terms

Let’s use sklearn to get a sense of how our document corpus is structured.

from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(input='filename', preprocessor=extract_text)
X = vectorizer.fit_transform(good_nbs)# good_nbs is a list of notebook paths as strings

The vectorizer now contains a vocabulary and an IDF vector that reveals the inverse document frequency of each word.

Below are the ten terms with lowest IDF:

import
as
in
for
from
print
and
the
data
of

Terms with low IDF are common and occur in almost every notebook.

On the other end of the spectrum, we have the ten terms with the highest IDF:

Gleicher
gebrauch
Integrierender
nsm
Linksorthokomplemet
aq2
aq1
Funktionen
Ansazu
VeraCrio13

Terms with high IDF are idiosyncratic, and tend to occur in a single notebook. For the curious, if we histogram the IDF values into deciles, we get the following distribution.

This shows a long tail of infrequent (high IDF) terms, and a small number of common terms (low IDF) with relatively low information gain.

Nevertheless, we’d only need to remember a slightly uncommon term (i.e. from decile 4 or higher) to cut a six-thousand notebook corpus down to two dozen notebooks, as shown below. I’ll take those odds to find a precious code snippet.

I randomly sampled ten terms from each decile of the IDF histogram (a total of one hundred terms), with the following results:

Precision, recall, and number of true positives (“docs”) for ten random words from each decile of the IDF histogram.

Decile 5 is a bit of an anomaly, where we ended up with 6 of our ten words in Chinese, and a small addressable corpus of just 9 documents. For each decile, I computed a weighted precision and recall score that accounts for the number of true positives to more accurately affect aggregate accuracy.

The weighed precision and recall across all documents is 99.3% and 99.8%, respectively. The F1 score is 99.5%

That’s pretty good. As noted above, our precision/recall analysis is a quick confidence check, and not really a thorough analysis of ElasticSearch. A more complete analysis would include multiple search terms.