Richard Bergmair's Blog


==> Noah Smith of Bloomberg writes that American Employers Are Hung Up on Hiring Ph.D.s. But I’ve also seen some social media echo from Ph.D.s saying things like “well, actually, my Ph.D. has mostly felt like a liability.”

This resonated with me a great deal: Whenever I join a new workplace, I feel like building credibility is an uphill battle where I fight against the ghost of someone who worked there before who had a Ph.D. and who fit the stereotype of the dysfunctional academic.

I wish that, just because I can do one type of thing well, people wouldn’t jump to the conclusion that I can’t also do another type of thing well. As in: Just because I understand math, please don’t jump to the conclusion that I don’t understand databases and can’t write clean code. Just because I can see the “right way” to solve a problem doesn’t mean I can’t also appreciate a cost-efficient way of solving 20% of the problem for 80% of the benefit.

#business   |   Mar-27 2019

==> Eike Kühl of Zeit Online reports: “Wer das Darknet ermöglicht, könnte bald Straftäter sein”. (Translation: “Whoever facilitates the Darknet could soon be considered a criminal”).

A link based on Google translate is going around the English-speaking internet, accompanied by speculation that Germany is trying to make Tor illegal or make it illegal to run exit nodes.

I thought I’d comment on that for the benefit of my English-speaking friends, having read the article in full in its German original.

The article details that the proposal is to make it a crime to run sites whose access is restricted by specific technical measures and whose purpose or activity is oriented towards aiding crime. The notion that it would become a crime to run a tor exit node is speculation on the part of the authors of this article.

But it is not apparent to me how that would follow from such a law: For example, since anyone can access Tor, and, by implication, any Tor exit node, access to a Tor exit node is not restricted in any way, so it would not seem to fall under this definition in my opinion.

As for running a Tor hidden service, this would seem to apply, but still: How do you draw the line between running a service where it just so happens that criminal business is conducted over that service versus running a service whose purpose or activity is actually directed towards aiding crime?

#politics#computers   |   Mar-15 2019

==> Philip Brasor und Masako Tsubuku von der Japan Times schreiben: “Japan’s tax laws get in way of more women working full time”. (Übersetzung: “Japans Steuergesetzgebung verhindert, dass mehr Frauen in Vollzeit berufstätig werden”).

Ich vermute einmal, dass dies nicht nur in Japan so ist, sondern in den meisten Ländern, da es eigentlich nur ein Nebeneffekt ist von der normalen Funktionsweise der Einkommenssteuer.

Gehen wir zum Beispiel von einem Zweipersonenhaushalt aus, bestehend aus Personen A und B.

Erste Konfiguration: Person A erzeugt Wertschöpfung, indem sie den Haushalt betreibt, die Kinder erzieht etc. Diese Wertschöpfung ist steuerfrei, da unentgeltlich. Wenn danach noch Zeit übrigbleibt, kann A noch einen Nebenjob machen, der keinen Vollzeitstatus erfordert. Der Stundensatz dafür wird zwar nicht so gut sein wie bei einem Vollzeitjob, dafür ist die Entlohnung aber steuerlich sehr effizient. Da der Betrag in Summe übers Jahr vergleichsweise gering ausfallen wird, fällt ein großer Anteil in die günstigeren Steuerbänder. Dadurch wird B freigespielt, um Geld zu verdienen.

Zweite Konfiguration: Sowohl Person A als auch B arbeiten Vollzeit und entgeltlich. Damit sind aber beide Einkommen steuerlich ineffizient. Von dem Geld, das übrig bleibt, wird Hilfe im Haushalt und bei der Betreuung und Erziehung der Kinder zugekauft.

Für einen finanziellen Break-even muss der Stundensatz von A unter der zweiten Konfiguration sehr viel höher sein als unter der Ersten, um die steuerliche Ineffizienz auszugleichen. In der Praxis wird man selten solche Haushalte haben, wo beide Personen die Möglichkeit haben, so gut zu verdienen.

#politics   |   Mar-08 2019

==> Ich bin gerade auf Augustin G. Walters’ Blogpost “The Last Free Generation” gestoßen. (Übersetzung: Die letzte freie Generation.) Gemeint ist der Überwachungskapitalismus (“surveillance capitalism”). Reaktion: Man müsste auf pseudonyme oder anonyme Medien umsteigen.

Das hätte vielleicht vor 20 Jahren noch funktioniert, als Anonymität noch Standard war im Internet. Damals wurde allgemein noch wenig vertraut auf alles, was mit dem Internet zu tun hatte. Das war das goldene Zeitalter von IRC, und in manchen Nischenmedien ist das noch heute so, etwa HN oder Freenode.

Aber für eine Mehrheit von Benutzern und Gelegenheiten ist anonyme Kommunikation heute einfach keine Option mehr, und das Rad der Zeit lässt sich in diesem Punkt auch kaum mehr zurückdrehen: Durch die Präsenz von Medien wie Facebook, die Anonymität verbieten, wurde die “unschuldige” Massenkommunikation von Medien wie IRC abgezogen, wo Anonymität noch erlaubt wäre, und übrig bleiben nur fragwürdige Gestalten, die dort zwielichtige Zwecke verfolgen. Wer Bezugsquellen für Rauschgift ausfindig machen will, der wird das kaum auf Facebook tun. Wer ein Foto von seinem Mittagessen mit seinen Freunden teilen will, der wird dazu als Medium kaum IRC wählen. (Gemeint ist mit “IRC” natürlich nicht Freenode, sondern die Übrigsbleibsel des guten alten IRCnet und Konsorten).

Ein Nebeneinander von anonymen mit nicht-anonymen Medien wird es den anonymen Medien immer schwer machen, sich all dessen zu entledigen, was in sozialen Medien unerwünscht ist. So lange Pseudonymität bzw. Anonymität also nicht Standard ist, sowohl für sensible als auch für nicht-sensible Kommunikation, haben wir damit also nicht viel gewonnen.

#computers   |   Feb-03 2019

==> Several people are asking me to go into detail here.

My response ended up quite lengthy. Sorry about that, folks. Editing things down is hard work, and I don’t have the time for it right now, so I’ll just post the long version here.

First, despite all the attention that big data is getting, most data science use cases aren’t actually use cases of big data. There are three reasons for that (a) You will often work with datasets so small that the sheer amount of data and processing efficiency just won’t present a difficulty. (b) You will often be in a situation where it is workable to do random sampling of your real dataset very early on in the data processing pipeline, which will allow you to still obtain valid statistics estimates while reducing the size of the datasets that need processing. (c) You will often be able to do preaggregation. (For example: instead of each observation being one record, a record might represent a combination of properties plus a count of how many observations have that combination of properties).

My strategy for dealing with data that doesn’t fall into the big data category is this: A database object that’s tabular in nature is, by default, a CSV file. An object that’s a collection of structured documents is, by default, a YAML file. The data analysis is split into processing steps, each turning one or more input files into one output file. Each processing step is a Python script or Perl script or whatever. You can get pretty far with just Python, but, say there’s one processing step where you need to make a computation where there’s a great library to do it in Java, but not in Python; feel free to drop a Java programme into the data analysis pipeline that otherwise consists of Python. Then you tie the data processing pipeline together with a Makefile. An early part of the processing pipeline is often to convert the human-readable CSV and YAML files into KyotoCabinet files with the proper key/value structure to support how they will be accessed in later parts of the pipeline.

This general design pattern has several things to recommend it:

  1. Everything is files, which is great, especially if you work in a big dysfunctional corporate environment. My experience has been that it’s comparatively easy to get a corporation to allocate you a server for your exclusive use. Run out of capacity and need more servers? An admin will usually be able to find an underutilized server for you and give you an account as long as you don’t have any special requirements. (And being able to process files is not a special requirement). They usually also have databases, like enterprise-scale Oracle servers. But, usually, they won’t just give you the keys to the kingdom. You’ll have to put in provisioning requests with DB admins, which require signoff from managers, etc., which can take a long time. And planning and enforcement of capacity quotas for the people using such shared resources are often inadequate, and you’ll get the tragedy of the commons unfolding. Hey admin, why are my workloads on the Oracle server suddenly taking so long? Because of all the other people whose workloads have been added. A request for more DB servers has already been put in, but until they can be provisioned, we’ll all have to be team players and make do. Meanwhile, your manager goes: Hey, why is it taking longer and longer for you to complete your assignments? Because Oracle is getting slower and slower! …this comes across as the data scientist’s equivalent of “The dog ate my homework.” The situation sucks. So it’s better to strive for independence, and using files where other people might use databases can help with that.

  2. I like to equip my scripts with the ability to do progress reports and extrapolate an ETA for the computation to finish. Suppose, 1% into the computation, it becomes apparent that it takes too long. In that case, I cancel the job and think about ways to optimize. I’m not saying it’s impossible to do that with a SQL database, but your SQL-fu needs to be pretty damned good to make sense of query plans and track the progress of their execution etc. If you have a CSV input, progress reporting might be as simple as saying, “if rownum % 10000 == 0: print( rownum/total_rows )”, then control-c if you lose patience. In practice, doing things with a database often means sending off a SQL query without knowing how long it will take. If it’s still running after a few hours, you start to investigate. But that’s a few hours of lost productivity. – Things are particularly painful when the scenarios described under (1) and (2) combine. You might be used to a particular query taking, let’s say, 4 hours. Today, for some reason, it’s been running for 8 hours and is still not finished. You suspect the database is busy with other people’s payloads and give it a few more hours, but it’s still unfinished. Only now do you start investigating what’s happening. But this sort of lost productivity is often the difference between making a deadline on reporting some numbers or something or missing it. (Think about the scenario where it’s a “daily batch”, and you need to go into that all-important meeting, reporting on TODAY’s numbers, not yesterday’s.)

  3. “Make” is an excellent tool for doing data science. Still, for it to work its magic, you mostly have to stick to the “one data object equals one file” equation. Do you want parallel processing? No problem. You’ve already told Make about the dependencies in your data processing. So wherever there isn’t dependency, there’s an opportunity for parallelization. Just go “make -j8”, and make will do its best to keep 8 CPUs busy. Do you want robust error handling? No problem. Just make sure your scripts have exit code zero on success and nonzero on failure. “make -k” will “keep going”. So when there’s an error, it will still work off the error-free parts of the dependency graph. Say you run something over the weekend. You can come back Monday, inspect the errors, fix them, hit “make” again, and it will continue and not need to redo the parts of the work that were error-free.

Now, after this whole prelude around my philosophy of doing data crunching pipelines in a data science context, we finally get to the point about KyotoCabinet. Even though you usually find that CSV or YAML is fine for most of the data objects in your pipeline, there will almost always be some where you can’t be so laissez-faire about the computational side of things. Say you have one CSV file which you’ve already managed down to a manageable size (1M rows, let’s say) through random sampling. But it contains an ID you need to use as a join criterion. Let’s say the table that the ID resolves to is biggish (100M rows, let’s say). You can’t apply any of the above “tricks” to manage down the size of the second file. By random sampling, you’d end up throwing away most of the rows, as your join would, for most rows, not produce a hit. So, for that file, you can’t get around having it sitting around in its entirety to be able to do the join, and CSV is not the way to go. You can’t have each of the 1M rows on the left-hand side of the join trigger a linear search through 100M rows in the CSV. Your two options for the right-hand side would be to load it into memory and join it against the in-memory data structure. Or use something like KyotoCabinet. The latter is preferable for several reasons.

  1. Scalability. Most data science projects tend to increase the size of data objects over time as they implement each feature. If you get to a point where a computation initially implemented in memory exceeds the size where this is no longer feasible, you’re in trouble. You might go to a pointy-haired boss and tell them, “I can’t accommodate this feature request without first refactoring the in-memory architecture to an on-disk architecture. So I’ll do the refactoring this week, then start working on your feature request next week”. This will sound in their ears like, “I’m going to do NOTHING this week”. So, I have a strong bias against doing things in memory.
  2. Making all data structures persistent means that the computational effort of producing them doesn’t have to be expended repeatedly as you go through development cycles.
  3. It may not even be slower in terms of performance, thanks to the MMAPed IO that KyotoCabinet and other key-value stores do.

…so this is roughly where I’m coming from as a KyotoCabinet frequent-flyer

#computers   |   Jan-12 2019

==> Charles Leifer writes on Kyoto Tycoon in 2019.

KyotoCabinet, the storage engine beneath KyotoTycoon, is one of the most important weapons in my holster as a data scientist.

For a data scientist, there are essentially two different kinds of job environments. One is where it’s all about infrastructure and implementing the database to end all databases (and finding stuff out is a mere afterthought). The other is where 100% of the focus is on finding stuff out by next week with zero mandate for any infrastructure investment. When I find myself in the latter kind of work environment, and I need to quickly get sortable and/or indexable data structures of any kind, then a key-value store is the way to go, and KyotoCabinet is a really good one that I’ve used lots and has never let me down.

Just keep your boss from finding out about it if yours is of the pointy-haired variety. He will be less than pleased if he finds out it’s an open-source project that saw its last commit some 6 years ago. – As for myself, this doesn’t bother me all too much. It’s feature-complete w.r.t. anything I’ve ever wanted to do with it, and its feature set is way richer than most of the younger alternatives that are still being actively developed (because they still need active development and aren’t nearly as well battle-tested). Plus, what’s the worst that could happen? Get & set is basically all there is to it, and if you should ever be in a position where you need to replace it with something else that has get & set, there should be tons of options. The migration should be easy to do.

#computers   |   Jan-12 2019

==> I just discovered the AI2 Reasoning Challenge.

It reminds me a lot of the Recognizing Textual Entailment Challenge that was in its fourth iteration when I submitted results from the system I did for my Ph.D.

#computers   |   Jan-05 2019