Richard Bergmair's Blog


==> I discovered a nice write-up by Elena Grandi on how to run a modern XMPP server.

XMPP fills the very important “simple and gets the job done” niche in this space. I was looking into options for a self-hosted chat service for team development about a year ago and landed on XMPP. It seemed like the best option, given that I don’t want any cloud service and that I find Mattermost to be a bit too bloated for my taste.

#computer   |   Dez-02 2023

==> Unixsheikh says, “We have used too many levels of abstractions and now the future looks bleak”.

The article anticipates a “Yes, let’s all go back to coding in assembly!” critique and responds.

For a really, really long time after high-level languages had become mainstream, you did still have to know assembly to be a programmer, even if you did most of your work in, say, C, or Pascal. That’s because compilers for high-level languages and their debugging tools were initially a “leaky” abstraction. When your programme failed, you had to know assembly to figure out what went wrong and work your way back to what you could do in your high-level language to fix the problem. Now, compilers and debugging tools have become so good that those days are mostly gone, and you don’t need to know assembly any more (for most practical intents and purposes).

The lesson here is that when a new abstraction hits the scene, you can anticipate that it will take a long time until the technology is so reliable that you really don’t need to understand the lower-level stuff. Meanwhile, you’re dealing with leaky abstraction.

Today, we pile on layer upon layer upon layer of leaky abstraction without ever giving it the time it needs to mature. We’re designing for shortening the amount of time a developer spends on getting something done, under the misguided assumption that the developer will never leave the “happy path” where everything works. This neglects that developers spend most of their time debugging situations that don’t work. Usually, if you make the “happy path” time more productive with a side effect of making the “unhappy path” time less productive, that amounts to a net negative, and that’s the big problem.

#computer   |   Okt-21 2023

==> I just discovered Wouter Groeneveld’s post on overlooked reasons to still buy physical media.

The thing I find noteworthy about physical media is that they yield a marketplace that is in many ways more “democratic” and more efficient than streaming media and SaaS subscription models.

For example, if you’re a hobbyist musician, it’s pretty favourable economics to put your songs on a CD, go out busking on the street, and sell the CD there. The CD is a perpetual licence to play the music as many times as you want, whenever you want, including the right to lend it to other people, pass it on to your children as part of an inheritance or otherwise, etc. etc.

So that’s just about the broadest possible legal right that the artist can give the listener, and, consequently, it should also command the highest possible price. You can charge, say, $10 for the CD, as opposed to $0.01 (I don’t know what the actual number would be) for each streaming play of one of your songs.

Dealing in the broader right is much more advantageous to the artist who is low on capital: They’ve had an up-front cost in creating the music and recording the CD, and selling CDs will amortize that investment quicker than selling streaming plays. Streaming plays may pay more dividends later if the music gets listened to a lot for a long time. However, redistributing that cash flow towards the earliest possible point is much more advantageous financially.

So, if you have listeners paying for streaming plays and artists requiring the largest possible payment at the earliest possible time, you require some moneybags intermediary to convert one kind of cash flow into another, and that opens the door to everyone getting ripped off in the process.

One could interject that this is about the buy vs. rent distinction rather than the physical vs. download/streaming distinction. You could imagine the artist selling MP3s from their website through a paywall and charging for it as if it were a CD. But that opens problems on the listener’s side. Looking after an MP3 collection creates work and comes at a cost.

So those two things are highly interdependent: Physical media is the technology that pretty much makes it the most frictionless on both sides of the transaction for customers to collect and use media over large time horizons. Once you walk down the path of download/streaming, you get into the territory where the thing you need to provide to the customer starts looking more like a service and less like a product, opening a whole can of worms and changing the economics fundamentally.

#computer#wirtschaft   |   Okt-14 2023

==> Kraken Technologies writes about how they organise their very large Python monolith.

I’m a big fan of layered (or even linear) dependency structures, too. Here is a simple trick I use in my Lua codebases: The layout of my source code files usually looks like this:


It has always bothered me that codebases don’t have a beginning and an end and that, therefore, it’s difficult for a reader to know where to start when they just want to read the codebase.

So, my approach is to impose this linear order and only allow code that comes later to depend on code that comes earlier, never the other way around. This way, you can read through the code rather easily. As you look at any piece of code, it will only depend on code you’ve already looked at, so you don’t have to constantly jump through the codebase as you read.

I’ve also found that failure to correspond to a linearisable dependency structure is a “code smell” and that, as I try to eradicate that code smell, I frequently end up with code that’s better in all kinds of ways.

#computer   |   Jul-19 2023

==> Ted Neward is in favour of Embracing “Old” Tech, and I agree.

In one of his books, Nassim Nicholas Taleb mentions the “Lindy effect”, and I’ve frequently used it as a mental model ever since coming across it: For a book that has continuously been read for the last 100 years, you can expect it will continue to be read for the next 100 years. For a book that has only come out last year and has been continuously read since then, you should have an expectation that one year from now, people will no longer be reading it.

So, when picking a tech stack for a software investment and you want a reasonable expectation that it will still be actively maintained 10 years from now, you need to go back in time by 10 years. I’ve found this 10 years time horizon a useful one to work with, based on version histories of various bits of software I build upon. For example, for Python, this takes me back to version 3.4, before type decorators and many other things hit the scene that I disapprove of. So now, I test all my code with both version 3.4 and the newest version (3.11). I won’t use language features from 3.11 that weren’t already there in 3.4, and I won’t use code in 3.4 that breaks or throws deprecation warnings in 3.11. I also apply this test to dependencies, paying close attention to what happens if I try to get the newest version of some Python library running on version 3.4 of the interpreter or a 10-year-old version of the library on version 3.11 of the interpreter.

This means my code is engineered so that it could have been in continuous operation for the last 10 years while running continuous updates on what’s underneath it. And this gives me a reasonable expectation that my code will require only minimal code changes over the next 10 years to keep up with whatever might arise.

With the python interpreter itself, it’s remarkably easy to do that. I feel it doesn’t limit my coding in any meaningful way. With other bits of software, including many python libraries that one might depend on, it would be absolutely unworkable. In such a case, I take that as a clear signal not to use the software dependency at all. This is a lot of work, but also a good forcing function that prevents me from becoming a dependency hog myself.

#wirtschaft#computer   |   Mär-02 2023

==> Gustav Westling uses Extremely Linear Git History.

That’s some rare reassurance right there that I’m not completely crazy for preferring subversion to git.

#computer   |   Nov-22 2022

==> Just came across Benji Weber’s blog article “Why I Strive to be a 0.1x Engineer”.

I’d like to make a point to the contrary here, based on a thought I initially came across in Tom DeMarco’s book “Peopleware”.

Often, from a strictly economic standpoint, you might be in a situation where you’d say, “quality level x is the quality level that the market is willing to pay for, while any higher quality level is uneconomical”. You might then be tempted to ask your employees to dial down the quality level of their work and produce worse work than they’re capable of. He argues that this is almost always a bad idea because of the demotivating effect of doing such a thing. You’ll get lower productivity and not realize the cost advantage you hoped for in your purely economic analysis. Or, to put it differently: Dialling up the quality level of your product from the level you economically need/want to the level your employees are capable of will typically pay for itself through increased productivity through increased motivation.

So, if you ask your employees to be 0.1x employees, you will get 0.1x output. You won’t accomplish more with less. You’ll accomplish less due to lesser productivity.

#computer#wirtschaft   |   Nov-10 2022

==> Just came across Pablo Guevara’s Manifesto for Minimalist Software Engineers.

The thing is: Pareto’s law really isn’t a law.

You think that the world is full of situations where 80% of the payoff comes from 20% of the work?

I tell you that, equally, the world is full of situations where you get 0% of the payoff unless you’ve done 100% of the work.

That latter observation is just as true as the former, but it won’t make anyone into a best-selling business book author or motivational speaker. It doesn’t help reduce cognitive dissonance when reflecting on laziness and ineptitude.

#computer#wirtschaft   |   Nov-07 2022

==> Ich bin gerade auf FreeShow gestoßen, eine Opensource-Alternative für die kommerzielle Software ProPresenter.

Da kommen Erinnerungen zurück. So eine Software habe ich ca. im Jahr 2000 geschrieben, als ich 16 Jahre alt war. Meine Kirchengemeinde hatte zuvor Transparenzfolien benutzt, um Liedtexte an die Wand zu werfen. Künftig sollte das per Computer gemacht werden, und dazu brauchte es eine Software.

Von allen Programmen, die ich seither geschrieben habe, und das sind immerhin 22 Jahre, wurde kein anderes so viel benutzt wie dieses; zumindest so weit ich das mitbekommen habe.

Einerseits ist das nicht gerade ein rosiges Bild, was den Verlauf meiner Karriere angeht.

Andererseits zeigt es, wie einfach es damals noch war, im Vergleich zu heute, eine profitable Nischenanwendung für Computer zu identifizieren und auszufüllen. Wenn ich damals die Schule geschmissen hätte und mich auf die Weiterentwicklung und Vermarktung des Programms konzentriert hätte, dann wäre meine Karriere ganz anders verlaufen. Meine Kirche war ja nicht die einzige im Land, die gerade ihren Overheadprojektor ausmusterte, also war das wohl eine echte Marktchance, die sonst niemand für sich wahrgenommen hatte.

Als ich 10 Jahre später mit meinem Doktorat fertig war, war die Welt schon eine ganz andere. All die Marktchancen, die niedrig hängenden Früchte, waren weg und ich trat während einer Rezession ins Berufsleben ein.

Heute würde ich sonst was drum geben, um ein Indie-Softwareentwickler zu sein, wie es sie in den 90ern gab. Vielleicht romantisiere ich das auch, aber es entspricht jedenfalls meiner Gemütslage.

#computer#wirtschaft   |   Aug-15 2022

==> The French presidency of the Council of the European Union has issued a “Declaration on the Common values and challenges of European Public Administrations”. It’s receiving accolades in software development circles for making direct mention of open-source software.

To me, it sounds very wishy-washy. The action verbs are “recognize the […] role played”, “promoting the sharing of […] solutions created/used”, and “promoting a fair redistribution of the value created […]”. This stuff is so vague that I have zero idea what it’s supposed to mean in practice.

The language also seems to be quite deliberately referring to stuff that’s already there and definitely not saying anything like putting more of it into place or systematically favouring open source over closed source or anything like that.

When software is developed in-house in public administration, then open-sourcing it, in my mind, is a no-brainer. But what sort of circumstances drive a government official to develop something in-house in the first place or adopt an open-source solution, when that competes with giving the government contract to industry cronies.

It would be really nice if we could get a clear policy framework that says: Whenever open-source is an option, the government must go with open-source. But we are far away from anyone in power even calling for that.

And we’ve seen some real setbacks where that is concerned: For example, the city administration of the city of Munich had that policy in the late 00s / early 10s, to the point where they had migrated 15000 workstations to Linux by 2013, shaving €10M/yr off Microsoft’s bill to the taxpayer. By the end of 2020, the city had migrated everything back to Windows. – See here, and here.

#politik#computer   |   Mär-17 2022

==> Dan Primack of Axios reports: Search engine startup from former Google Ads boss raises $40 million, and many people on social media are commenting that it’s crazy due to the sheer size of today’s web.

But the thing is: You’re dealing with insanely long-tailed distributions. The “meat” of the search engine business is in the fat heads of those distributions.

  1. A small number of queries constitutes a huge proportion of query events you’ll see throughout the day.

  2. For any given query, a small proportion of users will make up a huge proportion of the opportunity to monetise (researching a planned purchase, looking for a job etc.)

  3. For any given query, an infinitesimally tiny proportion of the documents on the web is where the value to the user actually is.

I think there are potentially many ways of making selections on each of those three axes and ending up with a viable business based on a manageably small search index. Just think of indeed as a job search engine or Amazon as a product search engine. They have manageably small document collections, great value to users, a stable user base, and monetisation opportunities.

From that standpoint, I find it surprising that there aren’t many more search engine businesses.

Case in point: I really like, the 90s nostalgia search engine. I don’t think they are a profitable business, but I certainly think they could & should be.

And even Google is making deliberate choices along those three dimensions rather than naively indexing the web and naively executing keyword searches against that index.

  1. Given a query, reinterpret as follows. Catch eyeballs by delivering entertainment value (“pizza” -> “entertaining videos related to pizza”). Monetise those eyeballs by reinterpreting them as local queries (“pizza” -> “restaurants near me wanting to sell me pizza”) or products queries (“pizza” -> “online shops trying to sell me pizza-related items”)

  2. Given a query and document, always make relevance decisions for the audience with more disposable income. For example, “farming” shows a lot of stuff you’d want to read if you’re paying £10 for a potato in Borough Market and nothing you’d want to read if you’re a subsistence farmer in Namibia.

  3. When comparing two documents for inclusion in the pool of documents that stand any chance of coming up on page #1, prefer recency to authoritativeness. For “programming languages”, apparently “The 9 Best Programming Languages to Learn in 2021” is considered relevant, while “Go To Statement Considered Harmful” or the Mozilla Developer Network are considered irrelevant.

There are huge audiences who disagree with Google’s choices. They are just waiting to switch or use another product alongside Google if someone comes along making different choices.

So, I really don’t think that any venture in the web search space is doomed, given the size of the engineering effort and Google’s dominance. I’m baffled that there isn’t much more activity in that space.

#computer#wirtschaft   |   Mär-11 2021

==> Jon Aasenden of Embarcadero claims that the history of Delphi amounts to 25 years of excellence.

I used to love Delphi in the late 90s. If someone could do a late-90s-Delphi-like experience for programming UI-heavy software today and do it right and in a way that properly integrates with the anno 2020 tech landscape, then I would love to use such a thing. Unfortunately, no such thing exists today, as far as I can tell, including Delphi itself.

#computer   |   Feb-14 2020

==> Ich bekomme einige Rückmeldungen, die darauf bestehen, dass eine 70-prozentige Kommission doch eigentlich gar nicht so schlecht ist.

Das, was ich “70 % Kommission” nannte, bezeichnen Brancheninsider als “30 % Tantieme” und Autorentantiemen für Bücher liegen historisch näher bei 10 bis 15 %. – Das wäre also eine Kommission von 85 bis 90 %.

Meine Antwort: Die Ökonomie hinter dieser Marktstruktur kann doch auf E-Books überhaupt nicht übertragen werden. Das eine, was an E-Books neuartig war, das waren die geringen Stückkosten, was den Preis senken würde. Das andere war die Oligopol-artige Stellung der Verkaufsplattformen in der Kanalökonomie (“channel economics”), was den Preis treiben würde. Letzteres war schlussendlich Trumpf und hat dafür gesorgt, dass der volkswirtschaftliche Mehrwert des E-Books nur Amazon und Konsorten bereichert; nicht die Autoren und auch nicht die Konsumenten.

Außerdem geht es mir in meiner Argumentation, wie gesagt, eher um Indie-Autoren mit Nischenpublikum, was man mit dem Verlagswesen im Massenmarkt nicht vergleichen kann. Denken wir als Beispiel für den Massenmarkt an ein Kochbuch. Hier wird 99 % des Wertes durch Marketing erzeugt. Der Verlag wird Werbeflächen mieten und den Autor in Kochshows im Fernsehen platzieren, um das Buch zu bewerben. Der Verlag wird dafür sorgen, dass Buchläden für das Buch gute Ladenfläche einsetzen. In so einer Situation bin ich absolut derselben Meinung: Der Verlag sollte hier locker 70 % verdienen oder auch mehr, da es schließlich auch der Verlag ist, der den Wert erzeugt. Der Autor, der das Kochbuch geschrieben hat, ist austauschbar. Aber Amazon tut nichts dergleichen für seine E-Book-Autoren.

Als Beispiel für das, was ich meine, denke man jetzt an einen Akademiker. Stellen wir uns vor, sein Lebenswerk habe darin bestanden, ein Lehrbuch zu verfassen, das auf seinem Gebiet aufgrund seiner Qualität zum Standardlehrbuch wurde. Wenn dieser Autor nicht wäre, dann wäre da auch kein Produkt, das der Verlag verkaufen könnte, und es wird dann eher so sein, dass der Verlag austauschbar ist. In so einer Konstellation scheint es mir absurd, wenn der größte Teil des Erlöses nicht beim Autor landet.

#wirtschaft#computer   |   Dez-29 2019

==> Der Autor “hoakley” von der Eclectic Light Company schreibt: “Publishers determined to kill electronic books”. (Übersetzung: “Verlage scheinen entschlossen, E-Books umzubringen”).

Da bin ich ganz derselben Meinung. Der E-Book-Sektor hätte enormes Potenzial, die Volkswirtschaft voranzubringen, und davon ist einfach nichts beim Einzelnen angekommen, weder bei den Autoren noch den Konsumenten.

Wer ein E-Book auf Amazon verkaufen will, muss 70 % der Erlöse an Amazon abtreten. – Es gibt eine 30 % Option, die aber nur verfügbar ist, wenn man bereit ist, sein Werk zu einem sehr bescheidenen Preis anzubieten und auch nur in Mainstream Märkten. Aber ein Indie-Autor mit einem Nischenpublikum wird häufig darauf angewiesen sein, einen höheren Preis zu verlangen, damit sich sein Buch rechnet.

Das scheint mir sehr hoch. Man vergleiche das mit Bandcamp, eine sehr erfolgreiche Vermarktungsplattform für Musik, die gerade einmal 15 % an Kommissionen nimmt.

Es gibt einige wenige Mitbewerber, die versuchen, zumindest einen kleinen Bissen von Amazon’s Mittagessen zu ergattern: Erwähnenswert wären hier Rakuten’s Kobo, sowie Handelsketten wie Barnes & Noble mit ihrem Nook in den U.S.A. und Thalia mit ihrem Tolino in Deutschland. Doch die folgen in ihrer Preisgestaltung Amazon. Bessere Deals für Autoren gibt es bei Google & Apple, deren Plattformen aber für E-Books bei Weitem nicht so interessant sind.

Man könnte E-Books auch verkaufen, indem man Direktdownloads vom eigenen Webshop anbietet, aber der Overhead wäre erheblich und vielen Kunden wird es nicht zuzutrauen sein, ihren E-Book-Reader per USB anzuschließen und EPUBs hochzuladen.

#wirtschaft#computer   |   Dez-29 2019

==> Ungleich GmbH says, “Turn off DoH, Firefox. Now.”, and I wholeheartedly agree.

Having worked for a major European telco, I get the impression that the amount of regulation they face around data protection and privacy is tremendous. My experience has been that this stuff is by no means taken lightly, either.

It would never in a million years occur to me to re-route my traffic away from the legal protections it enjoys with a European ISP’s network and instead entrust it to a nearly unregulated entity in the U.S.

In a German telco, data pertaining to individuals is stored for a limited period of time so that it can be requested on a case-by-case basis by law enforcement (we are talking Police, not all of the government). I can live with that.

In the U.S., there’s a highly developed and well-resourced mass surveillance system on both the business side (surveillance capitalism) and the government side (NSA et al.). Privacy laws there are almost non-existent and, to the extent that they do exist, they protect only U.S.-based persons and declare data pertaining to foreign persons as being up for grabs.

#computer   |   Sep-11 2019

==> Bloomberg’s Natalia Drozdiak reports that Huawei Eyes ProtonMail as It Searches for Gmail Alternative. People react with dismay. Proton Mail responds, Clarifying Proton Mail and Huawei.

The mainstream interpretation here is “Bloomberg messed up and got the story wrong”. Here is my two pennies’ worth, providing, purely speculatively, an alternative interpretation of what might have happened.

Bloomberg is a source that investors and traders trust with getting them some level of access to the rumour mill (in the spirit of the saying among traders that goes “buy the rumour, sell the news”). The problem here is that, fact or fiction, rumours affect the financial markets, and not knowing about them puts a market participant at a disadvantage.

The article starts by saying in indicative mood, “ProtonMail is in talks with Huawei Technologies Co. about including its encrypted email service in future mobile devices […].” I don’t see a problem with that part of the statement, since they were indeed in talks of some kind, and there’s a certain bandwidth of what “including” could mean. It could just mean “making available through Huawei AppGallery”, so there is nothing wrong with using indicative mood here.

In the second paragraph, the article switches the modality and says, “The Swiss company’s service could come preloaded …” Now, it could, of course, be the case, as people are alleging, that they just entirely made that shit up and manufactured a rumour. But it could also be the case that they were reflecting a rumour already out there and sufficiently widespread that they thought investors and traders should know about it. They used subjunctive mood using the auxiliary verb could to signal that something was going on here about the modality of the statement.

ProtonMail speculated that a misunderstanding of their earlier announcement must have been the basis of Bloomberg’s article. But I guess we’ll never find out if that was indeed so.

ProtonMail clarified their earlier announcement and took issue with the word “partnership” being used to describe their relationship with Huawei. Interestingly, they did not come flat out to respond to these assertions. For example, they did not say that preloading was not a topic that was discussed.

Now, it stands to reason that preloading would amount to Huawei handing a huge chunk of market share to ProtonMail. Then it would be up to users to make up their minds about the likelihood of Huawei asking for quid-pro-quo and ProtonMail’s response.

Rather than there being no basis at all for the Bloomberg article, another scenario could be that ProtonMail saw that making-up-of-minds play out on social media in response to the Bloomberg article and decided to do a one-eighty as a result.

… I guess we’ll never know.

#wirtschaft#computer   |   Sep-09 2019

==> Ecosia explains “Why we’re saying no to Google”.

This auction doesn’t address the problem it was actually meant to address, which is to stop an anticompetitive practice. The spirit of the law concerning antitrust is that you can’t abuse a monopoly in one market to gain a monopoly in another. That’s why it wasn’t acceptable that Google’s Android would set Google to be the default search engine and offer no other options.

Now Google says to those other search engines: Hey, you can be the default. But you’re going to have to give us all your profits. How is that any less anticompetitive than what they were doing before?

Footnote: Why am I saying all of their profits? Well, it’s four slots. Google will be one of them. Microsoft and Yahoo will bid whatever it takes to be on the list. – Now there’s one slot left for everyone who isn’t part of the existing search oligopoly, like Ecosia, Qwant, or DuckDuckGo.

Now imagine if this was open outcry: Ecosia bids X dollars. Qwant outbids them by offering X+1 dollars for that fourth slot. Well: If Ecosia knows they would still be profitable even if they had to pay X+2 dollars, that’s what they’re going to bid, right? They hit a limit only at the point where they know that the deal would turn unprofitable. The guy who gets the slot would, in open outcry, end up paying the next guy’s profit plus one dollar. But that’s not the model. They’re making sealed bids, and you’ll have to actually pay what you bid, so that’s why I’m saying all their profit.

#wirtschaft#computer   |   Aug-13 2019

==> Die Debatte ist immer noch voll in Fahrt, und ich bekomme noch Rückmeldungen. Dann noch eine Runde.

Hinter dem Design von Python verbirgt sich ja eine Philosophie, die ungefähr so lautet: Für jeden Gedanken hält die Sprache wenigstens einen und höchstens einen offensichtlichen Weg bereit, wie dieser auszudrücken ist.

Der erste Teil dieser Aussage (“wenigstens einen offensichtlichen Weg”) führt dazu, dass man die volle Ausdrucksstärke der Sprache auch dann schon erreicht hat, wenn man von allen Sprachkonstrukten, die der Sprache zugrunde liegen, erst die wenigen wirklich beherrscht, die anhand der Lektüre von Code die Offensichtlichen sind. Also, auch dann, wenn man erst am Anfang der Lernkurve steht und noch wenig in das Erlernen der Sprache investiert hat, bekommt man schon die volle Rendite, alles ausdrücken zu können, was einem so in den Sinn kommt.

Der zweite Teil dieser Aussage (“höchstens einen offensichtlichen Weg”) führt dazu, dass die Menge an Code, die man lesen muss, um an diesen Punkt in der Lernkurve zu kommen, so gering wie möglich gehalten wird, diese notwendige Anfangsinvestition also so gering wie möglich gehalten werden soll. Um das nachzuvollziehen, denke man an ein Beispiel: “Eine Iteration über Zeichenketten, wobei jede ausgegeben werden soll.” Wenn dieser Gedanke jedes, wirklich jedes, Mal seinen Ausdruck findet als for line in lst: print( line ), dann wird man beim Erlernen der Sprache frühzeitig und häufig mit den zugrunde liegenden Sprachkonstrukten konfrontiert werden. Dies erleichtert das Lernen ungemein.

Am Beispiel von Perl sieht man, wie es nicht aussehen sollte. Hier gibt es vielleicht zehn unterschiedliche Wege, wie man diesen Gedanken in einer oder zwei Zeilen Code ausdrücken kann, basierend auf der Anwendung von verschiedenen Sprachkonstrukten. Keiner dieser Wege genießt das Privileg, der Offensichtliche zu sein. Jeder Programmierer wird einen anderen bevorzugen. – Wer lernen will, beliebigen Perl-Code, geschrieben von einem anderen Programmierer, zu lesen, der muss daher erst einmal all diese Sprachkonstrukte kennenlernen. Gelegenheit dazu wird es aber weniger geben, da jedes ja jetzt im Code nur entsprechend seltener vorkommt.

Viele der Sprachkonstrukte, die bei Python jüngst eingeführt wurden, stellen sich aber gegen dieses Prinzip und führen überflüssige zusätzliche Wege ein, um mit wenig Code Gedanken auszudrücken, die man auch vorher schon mit wenig Code ausdrücken konnte. Das ist der Grund, warum ich sage, dass sich jetzt gerade bei Python die Fehler wiederholen, die bei Perl gemacht wurden.

#computer   |   Jul-17 2019

==> Ich bekomme hier gerade einige Rückmeldungen und sehe, wie die Diskussion in den sozialen Medien verläuft. Das Thema scheint sehr zu polarisieren.

Die eine Fraktion sieht die andere als einen Haufen unkritischer junger Naivlinge. Sie sehen da Veränderung um der Veränderung willen, ohne wirklichen Zweck und mit dem Risiko, dass hier alles unnötig verkompliziert wird. Die andere Fraktion sieht das Gegenüber als eine Ansammlung zynischer alter Säcke. Die würden sich gegen eine jegliche Neuerung stellen, einfach so aus Prinzip, und egal, worum es dabei eigentlich geht. Na ja, dann bin ich wohl einer der alten Säcke. So weit hergeholt ist das eigentlich nicht, weil Erfahrung tatsächlich einiges damit zu tun hat. So viele Leute gibt es heute nicht mehr, die Perl noch kennen und sehen, wie sich die Geschichte jetzt gerade wiederholt.

#computer   |   Jul-17 2019

==> Jake Edge von LWN: “What’s coming in Python 3.8”. (Übersetzung: “Was mit Python 3.8 auf uns zukommt”).

Als jemand, der über die letzten 16 Jahre seines Lebens fast jeden Tag Python Code geschrieben hat, sag’ ich jetzt mal: Ich bin nicht glücklich darüber. Einige dieser Neuerungen öffnen die Tore für Antipatterns, die mich durchweg frustrieren, wenn ich es mit Perl Code zu tun habe, den ich nicht selbst geschrieben habe. Ich war zum Beispiel immer sehr zufrieden damit, dass Python keine Sprachkonstrukte kennt, welche die Grenzen verwischen zwischen Code und Stringliteralen oder zwischen Anweisungen und Ausdrücken.

#computer   |   Jul-17 2019

==> Mark Christian meint: “You should have a personal website”. (Übersetzung: “Sie sollten eine persönliche Website haben”).

Es gehen in letzter Zeit viele Kommentare im Netz um, deren Grundgedanke im Prinzip lautet: “Wäre es nicht toll, wenn wir wieder so Web machen würden, als wäre es 1997?”

Aber aus dem Blickwinkel der Wirtschaftlichkeit gibt es einen großen Unterschied zwischen einer persönlichen Website im Jahr 2019 und einer persönlichen Website im Jahr 1997: Damals konntest du noch unmittelbar damit rechnen, auf Suchmaschinen den ersten Platz zu bekommen, wenn nach deinem Namen gesucht wurde. Auch was die Inhalte der Website betrifft, konntest du noch realistisch darauf hoffen, für einige Schlagworte einen guten Platz zu ergattern. Die Chancen für Letzteres stehen heute schlecht. Wenn du das Pech eines gängigen Namens hast oder ein “Namensdouble” mit einem starken Online-Fußabdruck, dann schafft es deine persönliche Website nicht einmal an die vorderen Plätze der Suchmaschinenreihung, wenn nach deinem vollständigen Namen gesucht wird.

Durch die wachsenden Nutzerzahlen im Netz sind mit der Zeit die Anreize immer größer geworden, Inhalte einzustellen, die Aufmerksamkeit auf sich ziehen. Kommerzielle Interessen können mit Geld und Ressourcen nach diesem Problem werfen, online Aufmerksamkeit zu ergattern. Eine persönliche Website einer Einzelperson kann damit kaum mithalten.

Die Kosten-Nutzen-Rechnung hat sich also stark verändert, insbesondere im Hinblick auf die Frage, wie häufig man neue Inhalte einstellen muss. Ein Mangel an Aktivität auf einer Website wird durch Suchmaschinen nämlich hart abgestraft, auch wenn sich dort hauptsächlich Inhalte befinden, die mit der Zeit nicht wirklich an Aktualität und Relevanz verlieren. Gerade von solchen Inhalten bräuchte es eigentlich mehr, und weniger von den immer-gleichen Inhalten, die von den Contentmühlen regelmäßig wiedergekäut werden, damit sie nicht aus dem Netz fallen.

Wenn es also wieder einen Nutzen haben soll, persönliche Websites ins Netz zu stellen, dann müsste zunächst das Problem gelöst werden, solche Inhalte auffindbar zu machen. Ich denke, dazu müsste eine spezielle Suchmaschine her. Einen Beitrag dazu könnte man auch leisten, indem man die persönliche Website auf einen “fediverse” Technologiestack aufbaut, um sie als soziales Medium auch über soziale Mechanismen auffindbar zu machen. Und es braucht eine Kehrtwende weg vom derzeitigen Trend in der Gesetzgebung, den Betreibern von Websites immer mehr Lasten aufzubürden. Eine einzelne Privatperson kann diese kaum noch tragen.

Also ja, ich bin voll dafür: Ich hätte gerne das Web von 1997 zurück. Aber damit das Realität werden kann, sind erst noch viele Herausforderungen zu stemmen.

#computer   |   Apr-30 2019

==> Muzayen Al-Youssef of der Standard reports that Government Seeks to Eliminate Internet Anonymity – With Severe Penalties.

The fact that they’re calling it “digitales Vermummungsverbot” already tells you everything you need to know: There is no real rationale here besides a political stunt of the right-wing government to curry favour with their base. The original “Vermummungsverbot” is a law prohibiting people from wearing a veil in public. The pretence here is that people hiding their identity are a threat. The political effect was that xenophobes liked the idea of a law opposed to Islam. The reality is that the law really has no effect. There are almost no people in Austria who would want to wear a veil in public in the first place, apart from maybe the odd female tourist visiting from Saudi Arabia. The idea now is that the same should apply to the digital sphere.

My guess would be that they know full well that it’s never going to pass into law and make it past Brussels. But to them, it’s a win-win. Either they get a law that appeases the right-wing populous. Or Brussels stops them, playing into their anti-European narrative.

#computer#politik   |   Apr-19 2019

==> Eike Kühl of Zeit Online reports: “Wer das Darknet ermöglicht, könnte bald Straftäter sein”. (Translation: “Whoever facilitates the Darknet could soon be considered a criminal”).

A link based on Google translate is going around the English-speaking internet, accompanied by speculation that Germany is trying to make Tor illegal or make it illegal to run exit nodes.

I thought I’d comment on that for the benefit of my English-speaking friends, having read the article in full in its German original.

The article details that the proposal is to make it a crime to run sites whose access is restricted by specific technical measures and whose purpose or activity is oriented towards aiding crime. The notion that it would become a crime to run a tor exit node is speculation on the part of the authors of this article.

But it is not apparent to me how that would follow from such a law: For example, since anyone can access Tor, and, by implication, any Tor exit node, access to a Tor exit node is not restricted in any way, so it would not seem to fall under this definition in my opinion.

As for running a Tor hidden service, this would seem to apply, but still: How do you draw the line between running a service where it just so happens that criminal business is conducted over that service versus running a service whose purpose or activity is actually directed towards aiding crime?

#politik#computer   |   Mär-15 2019

==> Ich bin gerade auf Augustin G. Walters’ Blogpost “The Last Free Generation” gestoßen. (Übersetzung: Die letzte freie Generation.) Gemeint ist der Überwachungskapitalismus (“surveillance capitalism”). Reaktion: Man müsste auf pseudonyme oder anonyme Medien umsteigen.

Das hätte vielleicht vor 20 Jahren noch funktioniert, als Anonymität noch Standard war im Internet. Damals wurde allgemein noch wenig vertraut auf alles, was mit dem Internet zu tun hatte. Das war das goldene Zeitalter von IRC, und in manchen Nischenmedien ist das noch heute so, etwa HN oder Freenode.

Aber für eine Mehrheit von Benutzern und Gelegenheiten ist anonyme Kommunikation heute einfach keine Option mehr, und das Rad der Zeit lässt sich in diesem Punkt auch kaum mehr zurückdrehen: Durch die Präsenz von Medien wie Facebook, die Anonymität verbieten, wurde die “unschuldige” Massenkommunikation von Medien wie IRC abgezogen, wo Anonymität noch erlaubt wäre, und übrig bleiben nur fragwürdige Gestalten, die dort zwielichtige Zwecke verfolgen. Wer Bezugsquellen für Rauschgift ausfindig machen will, der wird das kaum auf Facebook tun. Wer ein Foto von seinem Mittagessen mit seinen Freunden teilen will, der wird dazu als Medium kaum IRC wählen. (Gemeint ist mit “IRC” natürlich nicht Freenode, sondern die Übrigsbleibsel des guten alten IRCnet und Konsorten).

Ein Nebeneinander von anonymen mit nicht-anonymen Medien wird es den anonymen Medien immer schwer machen, sich all dessen zu entledigen, was in sozialen Medien unerwünscht ist. So lange Pseudonymität bzw. Anonymität also nicht Standard ist, sowohl für sensible als auch für nicht-sensible Kommunikation, haben wir damit also nicht viel gewonnen.

#computer   |   Feb-03 2019

==> Several people are asking me to go into detail here.

My response ended up quite lengthy. Sorry about that, folks. Editing things down is hard work, and I don’t have the time for it right now, so I’ll just post the long version here.

First, despite all the attention that big data is getting, most data science use cases aren’t actually use cases of big data. There are three reasons for that (a) You will often work with datasets so small that the sheer amount of data and processing efficiency just won’t present a difficulty. (b) You will often be in a situation where it is workable to do random sampling of your real dataset very early on in the data processing pipeline, which will allow you to still obtain valid statistics estimates while reducing the size of the datasets that need processing. (c) You will often be able to do preaggregation. (For example: instead of each observation being one record, a record might represent a combination of properties plus a count of how many observations have that combination of properties).

My strategy for dealing with data that doesn’t fall into the big data category is this: A database object that’s tabular in nature is, by default, a CSV file. An object that’s a collection of structured documents is, by default, a YAML file. The data analysis is split into processing steps, each turning one or more input files into one output file. Each processing step is a Python script or Perl script or whatever. You can get pretty far with just Python, but, say there’s one processing step where you need to make a computation where there’s a great library to do it in Java, but not in Python; feel free to drop a Java programme into the data analysis pipeline that otherwise consists of Python. Then you tie the data processing pipeline together with a Makefile. An early part of the processing pipeline is often to convert the human-readable CSV and YAML files into KyotoCabinet files with the proper key/value structure to support how they will be accessed in later parts of the pipeline.

This general design pattern has several things to recommend it:

  1. Everything is files, which is great, especially if you work in a big dysfunctional corporate environment. My experience has been that it’s comparatively easy to get a corporation to allocate you a server for your exclusive use. Run out of capacity and need more servers? An admin will usually be able to find an underutilized server for you and give you an account as long as you don’t have any special requirements. (And being able to process files is not a special requirement). They usually also have databases, like enterprise-scale Oracle servers. But, usually, they won’t just give you the keys to the kingdom. You’ll have to put in provisioning requests with DB admins, which require signoff from managers, etc., which can take a long time. And planning and enforcement of capacity quotas for the people using such shared resources are often inadequate, and you’ll get the tragedy of the commons unfolding. Hey admin, why are my workloads on the Oracle server suddenly taking so long? Because of all the other people whose workloads have been added. A request for more DB servers has already been put in, but until they can be provisioned, we’ll all have to be team players and make do. Meanwhile, your manager goes: Hey, why is it taking longer and longer for you to complete your assignments? Because Oracle is getting slower and slower! …this comes across as the data scientist’s equivalent of “The dog ate my homework.” The situation sucks. So it’s better to strive for independence, and using files where other people might use databases can help with that.

  2. I like to equip my scripts with the ability to do progress reports and extrapolate an ETA for the computation to finish. Suppose, 1% into the computation, it becomes apparent that it takes too long. In that case, I cancel the job and think about ways to optimize. I’m not saying it’s impossible to do that with a SQL database, but your SQL-fu needs to be pretty damned good to make sense of query plans and track the progress of their execution etc. If you have a CSV input, progress reporting might be as simple as saying, “if rownum % 10000 == 0: print( rownum/total_rows )”, then control-c if you lose patience. In practice, doing things with a database often means sending off a SQL query without knowing how long it will take. If it’s still running after a few hours, you start to investigate. But that’s a few hours of lost productivity. – Things are particularly painful when the scenarios described under (1) and (2) combine. You might be used to a particular query taking, let’s say, 4 hours. Today, for some reason, it’s been running for 8 hours and is still not finished. You suspect the database is busy with other people’s payloads and give it a few more hours, but it’s still unfinished. Only now do you start investigating what’s happening. But this sort of lost productivity is often the difference between making a deadline on reporting some numbers or something or missing it. (Think about the scenario where it’s a “daily batch”, and you need to go into that all-important meeting, reporting on TODAY’s numbers, not yesterday’s.)

  3. “Make” is an excellent tool for doing data science. Still, for it to work its magic, you mostly have to stick to the “one data object equals one file” equation. Do you want parallel processing? No problem. You’ve already told Make about the dependencies in your data processing. So wherever there isn’t dependency, there’s an opportunity for parallelization. Just go “make -j8”, and make will do its best to keep 8 CPUs busy. Do you want robust error handling? No problem. Just make sure your scripts have exit code zero on success and nonzero on failure. “make -k” will “keep going”. So when there’s an error, it will still work off the error-free parts of the dependency graph. Say you run something over the weekend. You can come back Monday, inspect the errors, fix them, hit “make” again, and it will continue and not need to redo the parts of the work that were error-free.

Now, after this whole prelude around my philosophy of doing data crunching pipelines in a data science context, we finally get to the point about KyotoCabinet. Even though you usually find that CSV or YAML is fine for most of the data objects in your pipeline, there will almost always be some where you can’t be so laissez-faire about the computational side of things. Say you have one CSV file which you’ve already managed down to a manageable size (1M rows, let’s say) through random sampling. But it contains an ID you need to use as a join criterion. Let’s say the table that the ID resolves to is biggish (100M rows, let’s say). You can’t apply any of the above “tricks” to manage down the size of the second file. By random sampling, you’d end up throwing away most of the rows, as your join would, for most rows, not produce a hit. So, for that file, you can’t get around having it sitting around in its entirety to be able to do the join, and CSV is not the way to go. You can’t have each of the 1M rows on the left-hand side of the join trigger a linear search through 100M rows in the CSV. Your two options for the right-hand side would be to load it into memory and join it against the in-memory data structure. Or use something like KyotoCabinet. The latter is preferable for several reasons.

  1. Scalability. Most data science projects tend to increase the size of data objects over time as they implement each feature. If you get to a point where a computation initially implemented in memory exceeds the size where this is no longer feasible, you’re in trouble. You might go to a pointy-haired boss and tell them, “I can’t accommodate this feature request without first refactoring the in-memory architecture to an on-disk architecture. So I’ll do the refactoring this week, then start working on your feature request next week”. This will sound in their ears like, “I’m going to do NOTHING this week”. So, I have a strong bias against doing things in memory.
  2. Making all data structures persistent means that the computational effort of producing them doesn’t have to be expended repeatedly as you go through development cycles.
  3. It may not even be slower in terms of performance, thanks to the MMAPed IO that KyotoCabinet and other key-value stores do.

…so this is roughly where I’m coming from as a KyotoCabinet frequent-flyer

#computer   |   Jan-12 2019

==> Charles Leifer writes on Kyoto Tycoon in 2019.

KyotoCabinet, the storage engine beneath KyotoTycoon, is one of the most important weapons in my holster as a data scientist.

For a data scientist, there are essentially two different kinds of job environments. One is where it’s all about infrastructure and implementing the database to end all databases (and finding stuff out is a mere afterthought). The other is where 100% of the focus is on finding stuff out by next week with zero mandate for any infrastructure investment. When I find myself in the latter kind of work environment, and I need to quickly get sortable and/or indexable data structures of any kind, then a key-value store is the way to go, and KyotoCabinet is a really good one that I’ve used lots and has never let me down.

Just keep your boss from finding out about it if yours is of the pointy-haired variety. He will be less than pleased if he finds out it’s an open-source project that saw its last commit some 6 years ago. – As for myself, this doesn’t bother me all too much. It’s feature-complete w.r.t. anything I’ve ever wanted to do with it, and its feature set is way richer than most of the younger alternatives that are still being actively developed (because they still need active development and aren’t nearly as well battle-tested). Plus, what’s the worst that could happen? Get & set is basically all there is to it, and if you should ever be in a position where you need to replace it with something else that has get & set, there should be tons of options. The migration should be easy to do.

#computer   |   Jan-12 2019

==> I just discovered the AI2 Reasoning Challenge.

It reminds me a lot of the Recognizing Textual Entailment Challenge that was in its fourth iteration when I submitted results from the system I did for my Ph.D.

#computer   |   Jan-05 2019

==> Tovia Smith und NPR berichten: “More States Opting To ‘Robo-Grade’ Student Essays By Computer”. (Übersetzung: “Mehr Bundesstaaten benoten Aufsätze von Studenten durch Computer”).

Als ich mich damals in Cambridge beworben habe, um dort zu studieren, hat man sich tatsächlich die Zeit genommen, Interviews mit mir zu machen. – Das war das Cambridge in England, nicht jenes nahe Boston, und ich hatte mich für zwei Studiengänge interessiert, den M.Phil. in Computer Speech, Text, and Internet Technology und den Bachelor in Mathematik als “mature student” am St. Edmunds College. Durchgeführt wurden beide Interviews durch Professoren, also wirklich durch die Inhaber der relevanten Lehrstühle, nicht durch Hilfsdozenten. In den U.S.A. war es damals schon üblich, standardisierte Tests durch schlecht bezahlte Hilfskräfte auswerten zu lassen; im Prinzip durch “Mechanical Turk”. Und jetzt nehmen sie sich nicht einmal mehr die Zeit, überhaupt noch einen Menschen mit der Aufgabe zu betrauen? Ich bin sprachlos.

Ich hoffe, Cambridge bleibt in diesem Punkt eine Hochburg der Vernunft und Menschlichkeit. Schließlich wird hier über die berufliche Zukunft junger Leute entschieden, und solche Entscheidungen sollten wirklich nicht Algorithmen überlassen werden.

#politik#wirtschaft#computer   |   Jul-21 2018