Richard Bergmair's Blog

==> Several people are asking me to go into detail here.

My response ended up quite lengthy. Sorry about that, folks. Editing things down is hard work, and I don’t have the time for it right now, so I’ll just post the long version here.

First, despite all the attention that big data is getting, most data science use cases aren’t actually use cases of big data. There are three reasons for that (a) You will often work with datasets so small that the sheer amount of data and processing efficiency just won’t present a difficulty. (b) You will often be in a situation where it is workable to do random sampling of your real dataset very early on in the data processing pipeline, which will allow you to still obtain valid statistics estimates while reducing the size of the datasets that need processing. (c) You will often be able to do preaggregation. (For example: instead of each observation being one record, a record might represent a combination of properties plus a count of how many observations have that combination of properties).

My strategy for dealing with data that doesn’t fall into the big data category is this: A database object that’s tabular in nature is, by default, a CSV file. An object that’s a collection of structured documents is, by default, a YAML file. The data analysis is split into processing steps, each turning one or more input files into one output file. Each processing step is a Python script or Perl script or whatever. You can get pretty far with just Python, but, say there’s one processing step where you need to make a computation where there’s a great library to do it in Java, but not in Python; feel free to drop a Java programme into the data analysis pipeline that otherwise consists of Python. Then you tie the data processing pipeline together with a Makefile. An early part of the processing pipeline is often to convert the human-readable CSV and YAML files into KyotoCabinet files with the proper key/value structure to support how they will be accessed in later parts of the pipeline.

This general design pattern has several things to recommend it:

Everything is files, which is great, especially if you work in a big dysfunctional corporate environment. My experience has been that it’s comparatively easy to get a corporation to allocate you a server for your exclusive use. Run out of capacity and need more servers? An admin will usually be able to find an underutilized server for you and give you an account as long as you don’t have any special requirements. (And being able to process files is not a special requirement). They usually also have databases, like enterprise-scale Oracle servers. But, usually, they won’t just give you the keys to the kingdom. You’ll have to put in provisioning requests with DB admins, which require signoff from managers, etc., which can take a long time. And planning and enforcement of capacity quotas for the people using such shared resources are often inadequate, and you’ll get the tragedy of the commons unfolding. Hey admin, why are my workloads on the Oracle server suddenly taking so long? Because of all the other people whose workloads have been added. A request for more DB servers has already been put in, but until they can be provisioned, we’ll all have to be team players and make do. Meanwhile, your manager goes: Hey, why is it taking longer and longer for you to complete your assignments? Because Oracle is getting slower and slower! …this comes across as the data scientist’s equivalent of “The dog ate my homework.” The situation sucks. So it’s better to strive for independence, and using files where other people might use databases can help with that.
I like to equip my scripts with the ability to do progress reports and extrapolate an ETA for the computation to finish. Suppose, 1% into the computation, it becomes apparent that it takes too long. In that case, I cancel the job and think about ways to optimize. I’m not saying it’s impossible to do that with a SQL database, but your SQL-fu needs to be pretty damned good to make sense of query plans and track the progress of their execution etc. If you have a CSV input, progress reporting might be as simple as saying, “if rownum % 10000 == 0: print( rownum/total_rows )”, then control-c if you lose patience. In practice, doing things with a database often means sending off a SQL query without knowing how long it will take. If it’s still running after a few hours, you start to investigate. But that’s a few hours of lost productivity. – Things are particularly painful when the scenarios described under (1) and (2) combine. You might be used to a particular query taking, let’s say, 4 hours. Today, for some reason, it’s been running for 8 hours and is still not finished. You suspect the database is busy with other people’s payloads and give it a few more hours, but it’s still unfinished. Only now do you start investigating what’s happening. But this sort of lost productivity is often the difference between making a deadline on reporting some numbers or something or missing it. (Think about the scenario where it’s a “daily batch”, and you need to go into that all-important meeting, reporting on TODAY’s numbers, not yesterday’s.)
“Make” is an excellent tool for doing data science. Still, for it to work its magic, you mostly have to stick to the “one data object equals one file” equation. Do you want parallel processing? No problem. You’ve already told Make about the dependencies in your data processing. So wherever there isn’t dependency, there’s an opportunity for parallelization. Just go “make -j8”, and make will do its best to keep 8 CPUs busy. Do you want robust error handling? No problem. Just make sure your scripts have exit code zero on success and nonzero on failure. “make -k” will “keep going”. So when there’s an error, it will still work off the error-free parts of the dependency graph. Say you run something over the weekend. You can come back Monday, inspect the errors, fix them, hit “make” again, and it will continue and not need to redo the parts of the work that were error-free.

Now, after this whole prelude around my philosophy of doing data crunching pipelines in a data science context, we finally get to the point about KyotoCabinet. Even though you usually find that CSV or YAML is fine for most of the data objects in your pipeline, there will almost always be some where you can’t be so laissez-faire about the computational side of things. Say you have one CSV file which you’ve already managed down to a manageable size (1M rows, let’s say) through random sampling. But it contains an ID you need to use as a join criterion. Let’s say the table that the ID resolves to is biggish (100M rows, let’s say). You can’t apply any of the above “tricks” to manage down the size of the second file. By random sampling, you’d end up throwing away most of the rows, as your join would, for most rows, not produce a hit. So, for that file, you can’t get around having it sitting around in its entirety to be able to do the join, and CSV is not the way to go. You can’t have each of the 1M rows on the left-hand side of the join trigger a linear search through 100M rows in the CSV. Your two options for the right-hand side would be to load it into memory and join it against the in-memory data structure. Or use something like KyotoCabinet. The latter is preferable for several reasons.

Scalability. Most data science projects tend to increase the size of data objects over time as they implement each feature. If you get to a point where a computation initially implemented in memory exceeds the size where this is no longer feasible, you’re in trouble. You might go to a pointy-haired boss and tell them, “I can’t accommodate this feature request without first refactoring the in-memory architecture to an on-disk architecture. So I’ll do the refactoring this week, then start working on your feature request next week”. This will sound in their ears like, “I’m going to do NOTHING this week”. So, I have a strong bias against doing things in memory.
Making all data structures persistent means that the computational effort of producing them doesn’t have to be expended repeatedly as you go through development cycles.
It may not even be slower in terms of performance, thanks to the MMAPed IO that KyotoCabinet and other key-value stores do.

…so this is roughly where I’m coming from as a KyotoCabinet frequent-flyer

#computers | Jan-12 2019