Versioning data and models for rapid experimentation in machine learning

Aneesh Karve
PyTorch
Published in
8 min readJan 16, 2020

--

In this article you’ll learn how to create and use versioned datasets as part of a reproducible machine learning pipeline. To illustrate, we’ll use Git, Docker, and Quilt to build a deep neural network for object detection with Detectron2, a software system powered by PyTorch that implements state-of-the-art object detection algorithms.

Reproducible models defined

As modeling projects grow, so grow the costs of debugging, scaling, and modifying the model pipeline. One method to minimize the costs of model maintenance is to train models in reproducible iterations. In the context of machine learning, we define a reproducible model iteration as the output of an executable script that is a pure function of three variables: code, environment, and data.

model := script(code, environment, data)

Given the same code, environment, and data as inputs, a reproducible training script will always produce the same model. The heart of a reproducible training script is a pure function in three variables, similar to the following:

docker run \
-e GIT_HASH=${GIT_HASH} \
-e QUILT_HASH=${QUILT_HASH} \
YOUR/IMAGE@${DOCKER_HASH}

Reproducibility increases agility

Reproducible models are not an end, but a means to faster, more correct iterations. A reproducible model history implies that developers can confidently reconstruct any past model iteration. As a result, reproducibility makes it easier for developers to experiment with modifications, isolate bugs, and revert to known good iterations when problems arise.

Versioned data is a missing ingredient

You’re likely familiar with systems like Git and Docker, which encapsulate code and environment respectively. These systems provide immutable hashes that represent snapshots in time. In a moment, we’ll introduce Quilt as a source of immutable hashes for data. A Quilt hash represents a snapshot of data in time.

Systems for versioning code are suboptimal for versioning data since data are fundamentally different than code: data are commonly thousands of times larger than code, data require specialized APIs for transport and serialization, and data require exploration (through search, browsing, and visualization).

Quilt source code and project status

Quilt is an open core platform designed to manage data like code: with versions, packages, repositories, and collaborative workflows. The source code for Quilt is available on GitHub. You can use the quilt3 Python client as a standalone library for creating, documenting, and sharing datasets. Optionally, you can run a Quilt stack, which adds a web data catalog with search, file preview, and role-based access control. The development of Quilt is funded by Quilt Data.

Overview and prerequisites

The present tutorial has three sections, each with different requirements:

  1. Installing versioned datasets — requires Python 3.6 or higher
  2. Reproducibly training a deep neural network — requires a GPU-equipped machine with nvidia-docker installed and at least 100GB of free disk space. For rapid training times, consider instances with eight V100 GPUs.
  3. Sharing and documenting custom datasets — requires Python 3.6 or higher, quilt3 (pip install quilt3), and an Amazon S3 bucket with a configured AWS CLI. To experiment with S3, you can sign up for the AWS free tier.

Caution: the machine instance types referenced in this article can incur significant cloud provider costs. Downsize the examples as needed.

Installing versioned datasets with quilt3

Downloading large datasets often depends on fragile scripts and slow or unreliable data stores. quilt3 offers a simplified interface for building, installing, and interacting with datasets. For example, you can use quilt3 to install the Common Objects in Context (COCO) dataset. Start by installing quilt3:

pip install quilt3

You’ll need at least 22GB of free disk space to install COCO. If you’re planning to train Detectron2 in the next section, you don’t need to run the following code. Here’s how to install COCO with quilt3:

# Note: installing COCO requires at least 22GB of disk
quilt3 install cv/coco2017 \
--top-hash 3722a4 \
--dest ./detectron2/datasets/coco/ \
--registry s3://quilt-ml

cv/coco2017 is the dataset name

  • --top-hash is a SHA-256 digest, automatically calculated by quilt3, for a revision of interest
  • --dest is the directory where the data are copied to local disk
  • --registry specifies where the data package resides, typically an S3 bucket

If you’d like to experiment with a smaller dataset (1.1MB), try quilt/altair:

quilt3 install quilt/altair \
--registry s3://quilt-example \
--dest ./YOUR/DIR/

For details on building your own datasets with quilt3, refer to Sharing and documenting custom datasets, below.

Optional: verifying a dataset

Before training, you may wish to ensure that your local copy of a dataset is free from changes:

quilt3 verify cv/coco2017 \
--top-hash 3722a4 \
--dir ./datasets/coco2017 \
--registry s3://quilt-ml

quilt3 verify computes the SHA-256 hash of every file in the dataset. Verifying cv/coco2017 takes about two-and-a-half minutes on an instance with 4 vCPUs and 16 GiB of memory.

Reproducibly training a deep neural network: Detectron2 on COCO

There are a variety of ways to combine Git, Docker, and Quilt for reproducible training. The following is one example. We define our code, environment, and data hashes, pull a Docker image (quiltdata/pytorch-detectron2-demo), then run the training job:

# define hashes
GIT_HASH=0a7a9d10
QUILT_HASH=3722a498
DOCKER_HASH=sha256:8d12a8997c6f65923f7e7788f70f70134b5f845ddcba5570beb5182c18d2526e
# pull image
DOCKER_IMAGE=quiltdata/pytorch-detectron2-demo@${DOCKER_HASH}
docker pull ${DOCKER_IMAGE}
# run image (interactively for illustration)
nvidia-docker run -it \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--shm-size=8gb \
# Optional: mount a volume (be sure to clone detectron2 there)
# --volume /YOUR/DIR/:/io \
-e GIT_HASH=${GIT_HASH} \
-e QUILT_HASH=${QUILT_HASH} \
${DOCKER_IMAGE}
## clone and install detectron2
git clone https://github.com/facebookresearch/detectron2
cd detectron2
git checkout ${GIT_HASH}
### "Running setup.py develop for detectron2" takes several minutes
pip install -e .
## install data
quilt3 install cv/coco2017 \
--registry=s3://quilt-ml \
--dest=./datasets/coco/ \
--top-hash=${QUILT_HASH}
## train
python tools/train_net.py \
--num-gpus 8 \
--config-file \
configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml

Versioning and loading trained models

For complete reproducibility, we’ll save the trained model by running a script inside of our container:

import os
import quilt3
model_pkg = quilt3.Package()
# capture all logs and checkpoints from /output
model_pkg.set_dir(".", "./detectron2/output/")
model_pkg.push(
"detectron2-trained-models/mask_rcnn_R_50_FPN_1x",
registry="s3://YOUR-S3-BUCKET",
message=(
f"detectron2@{os.environ.get('GIT_HASH')}, "
f"trained in quiltdata/pytorch-detectron2-demo@{os.environ.get('DOCKER_HASH')}, "
f"on cv/coco2017@{os.environ.get('QUILT_HASH')}"
)
)

Now that we’ve saved the model to Quilt, collaborators can load past models — along with their checkpoints and logs — for inference, auditing, and debugging:

quilt3 install detectron2-trained-models/mask_rcnn_R_50_FPN_1x \
--registry=s3://quilt-ml \
--dest=/detectron2/models/mask_rcnn_R_50_FPN_1x/ \
--top-hash=6e830aa5

cd /detectron2

python tools/train_net.py \
--config-file ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
--eval-only MODEL.WEIGHTS \
models/mask_rcnn_R_50_FPN_1x/model_final.pth

python demo/demo.py \
--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
--input YOUR_INPUT_1.jpg YOUR_INPUT_2.jpg \
--opts MODEL.WEIGHTS \
models/mask_rcnn_R_50_FPN_1x/model_final.pth

Loading data from quilt3 into PyTorch

Detectron2 provides its own code paths to load the COCO dataset. For custom datasets, you can use the torchvision’s DatasetFolder, or you can subclass torch’s Dataset:

from torch.utils.data import Datasetclass ExamplePyTorchDataset(Dataset):
def __init__(self, quilt_package_name, registry, pkg_hash=None):
# insert your own package here
pkg = quilt3.Package.browse(
"quilt/coco2017",
registry="s3://quilt-ml",
top_hash="3722a4"
)
# only return files in the train/ directory
self.img_entries = [
e for l, e in pkg.walk()
if l.startswith("train/")
]
def __len__(self):
return len(self.img_entries)
def __getitem__(self, idx):
entry = self.img_entries[idx]
img_annotations = entry.meta["annotations"]
return {
"image": entry.get_bytes(),
"annotations": img_annotations
}

Using the quilt3 APIs in your Dataset subclass has several advantages over raw Python, including caching, versioning, and abstraction of network and storage — all of which lead to more concise and reliable training scripts.

For higher performance, you can leverage torch’s DataLoader. See Writing custom Datasets, DataLoaders, and Transforms for further details.

Sharing and documenting custom datasets

The simplest way to construct your own versioned dataset is to snapshot the contents of local directory into a Quilt package:

p = quilt3.Package()
# snapshot a directory into package root
p.set_dir(".", "./your/local/dataset/")

For a more concrete example, let’s build, document, and share a custom subset of COCO that contains only images of animals. We start by using quilt3 browse to load a lightweight representation of an existing COCO package:

coco = quilt3.Package.browse(
"quilt/coco2017",
registry="s3://quilt-ml",
top_hash="3722a4"
)

browse loads only the package manifest into memory. Manifests are miniature key-value stores with a directory-like structure:

(remote Package)
└─annotations/
└─captions_train2017.json
└─captions_val2017.json
└─instances_train2017.json
└─instances_val2017.json
└─person_keypoints_train2017.json
└─person_keypoints_val2017.json
└─train2017/
...

Each entry in the package manifest has a logical key (a user-friendly path), a physical key (an URI to bytes), and optional metadata (eliminating the need to work with external annotations files). We can access package entries with bracket notation. For example, coco[“train2017”][“000000000009.jpg”].meta yields a dict:

{
'image_info': {
'license': 3,
'file_name': '000000000009.jpg',
'coco_url': 'http://images.cocodataset.org/train2017/000000000009.jpg',
'height': 480,
'width': 640,
'date_captured': '2013-11-19 20:40:11',
'flickr_url':
'http://farm5.staticflickr.com/4026/4622125393_84c1fdb8d6_z.jpg',
'id': 9
},
'annotations': [...]
}

We can use the .meta property of Quilt package entries to filter COCO down to images of animals:

coco_animals = coco.filter(
lambda l, e: l.endswith(".jpg") and
"animal" in (a["supercategory"] for a in e.meta["annotations"])
)

filter takes a lambda function with two arguments, logical_key and entry.

As a best practice, we’ll add a README file to coco_animals so that other developers have sufficient context to work with our new dataset:

coco_animals.set("README.md", "./YOUR/DIR/README.md")

We can now push coco_animals to S3 so that there’s a cloud-hosted copy of the dataset, available to anyone who has permissions to read the parent S3 bucket:

coco_animals.push(
# fill in your details below
"USERNAME/DATASET_NAME",
registry="s3://YOUR_S3_BUCKET",
message="Experimental subset of COCO 2017"
)

Once you’ve pushed a dataset to S3, your colleagues can use quilt3.list_packages() to discover datasets. Use quilt3 catalog to browse the contents of a dataset:

quilt3 catalog s3://quilt-example/akarve/coco_animals/

For instance we can click on the train2017 directory to ensure that coco_animals in fact contains pictures of animals.

The Quilt catalog generates lightweight, browser-compatible previews of large files such as Jupyter notebooks, Parquet tables, Vega-Lite visualizations, images, and more.

The quilt3 catalog command is only recommended for open data. For sensitive data, run a private Quilt stack, which restricts data to your virtual private cloud.

Earlier, we noted that quilt3.browse() loads a lightweight manifest, instead of the physical data. When we’re ready to work with the the physical data, we can lazily fetch it with quilt3’s get methods. For instance, if you’d like to render the contents of a README file in Jupyter, you can call get_as_string:

from IPython.display import display, Markdowndisplay(
Markdown(
coco_animals["README.md"].get_as_string()
)
)

get_as_string() is a convenience wrapper for get_bytes(), which we used in Loading data from quilt3 into PyTorch to fetch binary image data.

Conclusion and how to contribute

We’ve demonstrated how to improve the speed and correctness of model iteration through reproducible training scripts. We’ve also introduced Quilt as a system for versioning datasets.

Contribute datasets to the community

If you have an interesting public dataset to share, we encourage you to apply to curate data on open.quiltdata.com where large, public datasets are hosted free of charge.

Contribute code to Quilt

The Quilt team is actively working on scale, performance, and cloud-native tools for managing data like code. Contributions are welcome. Visit the Quilt roadmap, documentation, and Slack channel to learn more. The Quilt roadmap emphasizes technical areas like Python, Parquet, MinIO, and serverless functions.

--

--

Aneesh Karve
PyTorch

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.