Motivating Basic Statistics - for Programmers

Tuesday, 07 April, 2020

The Non Statistical Thinking Fallacy

It was a fine Saturday morning! Oh, I know this is a very cliched way of beginning any piece of literature but what can I do? It was a Saturday! It was a morning, and a fine one at that! I was attending a Meetup aimed at explaining Redis, an in-memory database typically used for speeding up your programs. An example code was being explained, executed and the time it took to execute was being measured.

A suspicious man in the audience eyes the code with keen eyes. After a while he says, "I think you should swap these two lines. The code will run even faster." Not everyone agrees and the audience gets divided and a debate takes us all astray from the main topic. With the hope of solving the debate once and for all, the person in-charge of the demo promptly swaps the two lines and reruns the program. It takes lesser time to execute!

Our suspicious man grins and looks at the audience with an emphatic look. Another person in the audience is still not convinced. "There should not be any difference." A polite request from the speaker suggests the audience to discuss it later.

What do you think? What is the problem with this scenario? Well, it is a meaningless way of settling a debate - that's the problem. It suffers from the 'non statistical thinking fallacy'. This fallacy arises when a person ignores an important fact - almost all measurements are noisy.

Enter Random Variables

A random variable is a quantity whose output cannot be predicted exactly. We also call it 'noise' in many contexts. Almost all measurements have a source of noise. So, even if the the true value of some quantity is deterministic, the measured value is a random variable.

$$ t_{measured} = t_{actual} + noise $$

In the example scenario above, the source of noises could be plenty

So, a single measurement of time taken by version A of the code and time taken by the version B of code is simply not enough to conclude whether the two versions (differing by a line swap) are indeed different in terms of execution time.

So, how shall we conclude in such a case? These are exactly the kind of questions that Statistics is designed to answer.

Central Limit Theorem

Say we measure the time taken to execute version A many times. Each time it will be different - this confirms that we are dealing with a random variable. There are two important properties we can compute from our measurements

$$ \mu = \sum_i \frac{t_i}{N} $$

$$ \sigma^2 = \sum_i \frac{(t_i - \mu)^2}{N-1} $$

The average tells us the 'central value' around which our measurements dance. The standard deviation tells us how much or how far they dance. So, if we decide to take any one measurement and quote it as the time taken for the program to execute, our estimate has an uncertainty around it quantified by the standard deviation. But let us say that we decide to take the average of all the measurements and quote the average instead? Are we still equally uncertain about our estimate?

The central limit theorem, among other things, says that our uncertainty on the average (not any individual measurement) is actually lesser than the uncertainty of any individual measurement. This uncertainty is called 'standard error' and is given by the Central Limit Theorem as

$$ SE = \frac{\sigma}{\sqrt{N}} $$

So, if you obtained the average using 4 measurements, your uncertainty goes down by 2, if you obtained it using 100 measurements, your uncertainty goes down by 10. Hence the instruction by your school and college teachers during an experiment to repeat the measurement many times and take an average.

Going back to our debating programmers, the first fallacy committed there is that they are relying on a single measurement where the level of uncertainty is considerably high.

Settling The Dispute

So, how shall we settle the question of whether version A of the code and version B are indeed different in terms of execution time? Here is a simple step by step prescription.

For convenience, let us assume their standard deviations are equal, then the distance could be defined as

$$ d = \frac{\mu_A - \mu_B}{\sigma} $$

The larger the value of this difference, the more adamant you can be that versions A and B differ in execution times. Typically, if the distance 'd' as described above is less than 3, you should be skeptical. Or you could try doing more measurements so that standard deviation goes down as root of N and you get a newer estimate of 'd'.

Can you quantify the 'adamancy' allowed? Yes! We will however need more advanced concepts such as confidence intervals, Student's t-distribution, p-value and more. Perhaps another time!

Beyond Execution Times

I took the example of the execution time as a measured quantity because I witnessed two programmers participating in this futile discussion, suffering from the non statistical thinking fallacy. But the world of computer science is filled with many examples where such a fallacy can be a problem (thankfully, where it matters, experts don't succumb to it). A few examples come to mind

Conclusions

I hope I have been able to demonstrate to you the necessity to keep some basic statistical principles at the forefront of your thought process, especially when measured quantities are involved. In some cases, people just are not aware. In other cases, people forget to think like this. Statistics is an important subject. Sure, we can leave the super advanced concepts to the statisticians but the basics are an absolute must.




Up