The Enterprise

Problems of Reproducibility in Computer Science

Michael Pankov •

There is a word "science" in "Computer Science". But is the real science there in our field of study?

In this post, I will go over the foundations of scientific method and describe what's wrong with it in studies of computers. I'll also try to identify the means of fixing the current, quite sad, state of affairs.

The scientific method, at its' core, is a universal practice, consisting of following steps:
  1. Form a theory.
  2. Predict the events should the theory be true.
  3. Gain experimental data.
  4. Reason about data and correct theory (if necessary).
The steps are very simple — and yet, they lie at the basis of the critical exploration of the world since the beginning of mankind.

Nevertheless, there are many cases when the approach doesn't immediately work. It's frequently because of lack of supporting evidence, or misinterpretation of gathered data. Geocentric model of the Solar System only staggered when the telescope was invented. Luminiferous aether plagued minds of scientists for a long time, thanks to the dual nature of light being hard to explain.

In Computer Science, we collect data on programs being executed on some computer system: time of completion, power consumption, latency of request handling. And the many factors involved in the interesting number being just that number we've got include CPU cache size, memory clock speed, and loop unrolling optimization being enabled in your compiler (among the other hundred). There are, of course, hundreds of other affecting features.

This is real. Two loops being run one after another may yield better performance result than doing the both actions in the same loop body. Or may not, depending on size of the array.

Printing "B" letter to the terminal may be 30 times slower than printing a "#" character. Sorting an array may miraculously make its' processing 6 times faster.

There are countless examples of such counter-intuitive behavior of computers. They are also barely predictable, although explainable post factum. But such an explanation is just anti-scientific — it's known as postdiction, an effect of "knew-it-all-along" thinking.

To carefully perform an experiment in Computer Science, one has to control for many variables, most of which are controlled for almost never. Our caches are state machines (and their state is not observable). Our CPUs can arbitrarily change their frequencies and voltages out of power saving concerns. When a scientific paper is published, there are only vague specifications of systems academics used: i.e., "Used CPU is Intel Xeon E5420" — as if it's everything what matters.

Speaking of scientific papers — there are many and many of them which are barely reproducible at all (1). They often refer to never released source code, and use ad-hoc methods for optimization of programs and measuring effects of the optimizations. Papers being behind the paywalls (2) of big publishers don't help the matter either.

We explain faster fall of a weight compared to a feather by the mass of the former being bigger. (Well, of course we don't, anymore!) That's the equivalent of what we're doing now in computational experimentation. It may even work in the most primitive cases and give seemingly accurate predictions. But once you try to build a catapult or launch a rocket, you need to understand that there's air friction and the more general law of gravitational force. And when you want to understand events on the Universe scale, you have to introduce relativity.

But we're walking in the dark, as the ancient people explaining a thunderbolt by wrath of Zeus.

There are many people who are concerned with current state of affairs. We need a systematic approach to collection and analysis of the data. We need APIs to make experiments reproducible. We should store everything in a universal format, so that later scientists have access to the many experiments we performed.

There are Collective Tuning and Collective Mind (3) initiatives with Grigori Fursin behind them — and I'm glad I participated in one them (although briefly). There's also my humble attempt in the area — Adaptor. It's in no way general enough or complete, but that was an attempt to reuse a lot of existing tools, while Collective Mind built everything from the ground up.

Frameworks such as presented above try to take out as much variables as possible. Nearly all software dependencies are managed by the tool, and the only thing left is OS and hardware. There are also unified statistics and machine learning methods of analysis of the collected experimental data.

Computational science being science again means a lot (4).

And we can help it (5).

(1) There are also critics of those skeptics.
(2) A rather controversial article, but there's a message.
(3) This is an actual working framework for computational data collection & analysis. In case you plan to do something in the area, you absolutely should give it a try!
(4) A presentation in PDF, slides 39-50 are particularly interesting.
(5) A Coursera course on reproducible research.
comments powered by Disqus