Reproducable Data Science
I recently gave a talk entitled “Data Pipelines A La Mode”, with the following premise.
We can use techniques from functional programming and distributed build systems in our (big) data (science) pipelines to allow us to know what code was used for each step, not lose any previous results as we improve our algorithms and avoid repeating work that has been done already
What are data pipelines?
I will take a pretty broad definition: A set of tasks that have dependencies between them (ie one must occur after the other)
We can represent a pipeline as a Directed Acyclic Graph (DAG). It is common to just refer to a pipeline as a dag, ill do so in other bits of this article
The simple pipeline we will be looking at in this article has four steps and the
following DAG. Once download_images
has completed, both edge_enhance
and
blur
can run, then we make a collage
of the results.
We can see the dag as an opportunity for parallelism
Three Things We Care About
Provenance: Do we know what data was used in a particular analysis?
Reproducibility: Can we run the analysis again?
Incrementality: If we change the dag or the tasks, can we avoid re-running things we already did?
How does being “pure” help us?
We are using the word pure in the same sense functional programming does:
A pure function is a function where the return value is only determined by its input values.
In functional programming, this means that a given invocation of a function can be replaced by it’s result, this permits memoization.
In general a task might take any actions: call external APIs, fetch data from somewhere, do something at random etc. This makes it harder to reproduce the analysis later if we need.
We will constrain our tasks so they all write their output to a single location
If a task depends on previous tasks, it should be pure in the sense it only depends on the data output by its dependencies. It should be deterministic too, for we are going to say that we get incrementality by not re-running the same task on the same data.
It can’t all be pure
Sometimes our initial inputs will be some files we already have, and we can just point at them. If not then we can have non-pure tasks at the edge “snapshotting” the world and saving it for us. We can then re-run any downstream analysis on the saved data.
Examples might be syncing from another S3 bucket (unless you have good reason to think it will remain accessible), hitting some external APIs or querying a database.
For help in understanding how to make each step pure, you should check out this blog on Functional Data Engineering or a talk by the author.
How can we represent tasks?
To keep as generic as possible, we just use an associative datastructure as in most programming languages (ie a hash/dictionary/object).
For our example dag above, assume we have a project called collage
with a few
Python scripts in and we build a docker image for it. The download_images
task might look like this
{
'image': 'collage',
'sha': '1qe34',
'command': "python dl_images.py"
}