 08 May 2018

# Statistical Comparison & Analysis of Performance Metrics

## 1. Introduction

Say you have a computer program / routine / function / test / loop, basically anything that runs and does something specific. And you want to improve that code to make it run faster, or increase the throughput or use less power. Whatever it is, ultimately you want different numbers. Those numbers could be anything from time taken in nanoseconds, CPU cycles, cache hits to coarse grained metrics such as “objects” processed, requests handled etc.

At the end of the day, you want to show that your code is better. Or that you’ve actually succeeded in improving it. For example, you’re timing how long it takes for a function to parse a string and output a hash table. Assume this function takes 100ms to run your initial test case. After spending some time profiling and tinkering you get this time down to 70ms. Which seems like a big improvement since it’s 14.3 times faster now.

Getting accurate measurements also requires running the code several times and let’s say that you’re collecting and collating all that data. Now to present your results, you make a line chart from the numbers and calculate the average of the two techniques and (possibly) conclude that the newer improved technique is better because on average it runs faster. Well…….not quite! This is missing some statistical analysis to pinpoint a few things. Just using an average as a measurement can be potentially misleading. At the same time you also need a definitive way to say how much better your newer and faster technique is.

In this post, I’ll be talking about using statistically significant measures and analysis to ensure that you understand, to some extent, the performance numbers you’ve procured and how to make more accurate comparisons when comparing the differences between them. Overall, while this may not be “relevant in the real world” or be “too academic”, these are important concepts that should be carefully evaluated. Often, they can reveal underlying meanings that you might have missed or overlooked or it can even reveal how you might be misinterpreting your data altogether.

While we will be dealing with a few mathematical formulae, I’ll not go into the details or proofs of the formulae. Instead it will be more focussed towards which one to use and how to use them so potentially you can have it as part of an automated performance testing framework.

## 2. Before You Measure

Before we start sampling our data, let’s ask ourselves this question: What exactly are we trying to show? A simple answer could be that we are trying to show Method B performs better than Method A. But what if we find out that Method B is actually worse or maybe even the same as Method A? We need to draw such conclusions and comparisons in an accurate and definitive manner, instead of just saying that one Method has the same or more average time taken than the other. While the actual comparisons and analysis require mathematical formulae, a null hypothesis gives a definitive structure to the way those analysis and comparisons are performed. The null hypothesis is the starting point of this whole process and is used in scientific query and research.

Let’s say we have Method A, which is the original method / implementation / program / technique and we have another Method B which we want to show as an improvement to the original one. One way we can define our null hypothesis is

$H_{0} =$ There is no difference between Method A and Method B

Essentially, a null hypothesis is an indifferent (and possibly cynical) statement about the relationship between the two methods. It’s at this point considered common knowledge or the “default”. You are meant to hate it; and you must try your darndest to disprove it. To do this, we propose an alternate hypothesis ($H_{1}$). This is a statement that you wish to be true. In this case the alternative hypothesis would be something like

$H_{1} =$ Method A and Method B are different

Although, we want to be able to prove that Method B is an improvement, I’m assuming that we either don’t have any data from Method A or we haven’t analysed it yet. If we have such data then both our hypotheses would be slightly different. I’ll talk about this case in later sections.

We’ll be using a null hypothesis and will be talking about confidence levels, and in the end use some measure of varibility for the comparisions. And while you may not use a null hypothesis for your day to day performance comparisons, it can serve as a guide when doing your comparisons.

## 3. Choosing the Appropriate Metrics

Knowing what to measure is quite important. Sometimes just timing something isn’t good enough, and perhaps using hardware counters give a better measurement. Sometimes measuring total process memory may not be accurate and one should probably only measure heap allocated pages. Regardless of whatever you are measuring, make sure you’re using the most appropriate metric for that measurement. Since that is a different topic altogether, I will not be covering it in this blog post. When you are finally taking data from sample runs, make sure you’ve done a few warm up runs before you actually start collecting and also account for cooldown if necessary. This presentation covers a few of the variables in a bit more detail.

Additionally, you need to know the number of samples / measurements required for your analysis to be accurate. There are quite a few formulae that help you calculate such a number. Most of these formulae calculate a sample size if we’re given certain parameters related to a finite population (categorical data). But in our case, there really isn’t a finite population or number of times we know the code is going to be run. The population size doesn’t really matter that much and there are methods for taking that into account and correcting your test statistics based on your population and sample size. However sample size does matter, and to calculate the minimum number of samples that would be required you need to provide some information beforehand. To do this, we shall use Cochran’s sample size formula for continuous data:

$n = {Z^2 \sigma ^2 \over E^2}$

$n$    minimum sample size required
$Z$    z-value based on the confidence level
$\sigma$    estimated standard deviation
$E$    required margin of error for estimated mean

Before I explain what each of the variables mean, I’ll briefly mention what we mean by Confidence Level (CL) here. The confidence level you choose is in indication of how reliable you think your data should be. Generally speaking, a confidence interval of 95% or 0.95 is acceptable. In certain cases when you need to be absolutely sure, you can choose your CL to be 99% or anything else for that matter. It is rather subjective however it does affect the minimum number of samples you should be taking as we’ll see in the subsequent paragraphs.

$Z$ can usually be obtained from a table such as this one which is based on the confidence level. For $\sigma$ you can either provide an estimate or do some pre-test calculations or if you have done this before use that value. And the last missing value you’ll have to provide is the margin of error for estimated mean. You can then try varying the parameters as you feel fit until you get a sample size and confidence level of your choosing.

For example, let’s say I’m choosing a confidence level of 0.90 (90%), with a margin of error of $\pm$0.03 ms (very low indeed!) and a standard deviation of 2ms. Plugging these values into the formula gives us:

$n = {(1.645)^2 (2) ^2 \over (0.03)^2}$
$n = 12,026.\overline{7}$

So at minimum we need 12,027 samples i.e. run each method 12,000 times, and record your metric for each run to generate statistically significant data. Let’s say we want to change this and increase our confidence level to 0.95 (95%), let’s recalculate the minimum number of samples required:

$n = {(1.96)^2 (2) ^2 \over (0.03)^2}$
$n = 17,073.\overline{7}$

Which is at least 17,074 samples, approximately 5,000 more than the previous one! So if your method takes more than a couple of seconds, you’ll probably need to invest a bit of time. In this case, the margin of error used is pretty low, but more accurate, if for example you use 1ms for margin of error you get only 16 samples! Play around with these values until you feel confident enough (no pun intended) with the value of $n$ for your use case, generally speaking aim for at least a few hundred samples.

The main takeaway from this is that for more accurate data, you need more samples. However, mean and standard deviation are NOT good metrics when it comes to sampling data in computer science. Try and avoid them in favour of median and quartiles. I’ll mention the reason behind this in the later sections as it’ll be easier to understand with a bit more context. So however many samples you decide to take, make sure they’re enough to produce decent data.

## 4. Sampling Data

Once you have sorted these out, it’s now time to run your code and collect your data for both methods A and B. We’ll be collecting 17,000 samples for both Method A and Method B. For this post these values will be artificially generated although it will look very similar to previous such measurements (I’ve) taken. This also allows us to talk about a few more things in this post.

It’s up to you how you want to collect your data. Personally, I’ve mostly just output numbers to the standard error stream and piped them to a csv or text file. This also makes things easier when processing them later on. Be careful if you’re running several instances or if your method has some mechanisms reliant on multi-threaded behaviour as you might need a different approach.

## 5. Viewing Raw Data

Some example numbers for each method are listed below (in ms):

Method A: [99.597, 93.712, 94.110, 90.315, 91.351, ...]
Method B: [70.246, 59.425, 65.229, 68.998, 61.983, ...]


And the full data is 17,000 timings similar to those. First of all, let’s plot the data as is. To do this, we’ll create a histogram with the timing on the x-axis and the sample count on the y-axis. And to achieve this goal, let’s use python and matplotlib 2.2. The code to generate the histograms is below (Note: I apologise in advance, my python skills are not up to the mark):

import matplotlib.pyplot as plt

# up to you how you want to generate this array
# from a file / csv etc
data_arr = [float(99.597) float(93.712), ....so on]

plt.xlabel("Time Taken (ms)", fontsize=20)
plt.ylabel("Number of Samples", fontsize=20)

plt.title("Method A", fontsize=30)

# somewhat arbitrarily chosen 100 bins
plt.hist(data_arr, bins=100, edgecolor='k')

# Approx 1920x1080
figure = plt.gcf()
figure.set_size_inches(20, 11.25)

plt.show()


And doing this gives us the following two graphs:  Right off the bat there are a few things we can explore in these graphs. Firstly, the timings are quite varied although you can see a clear peak where we have the most sample counts. Your graph may look like this, it can be more spread out or even less, both are fine.

Another thing you might have noticed is that both the graphs have a few stray values at the right end of each graph. Method A has a few values around 120ms which is way off it’s ~100ms peak and Method B has some similar values as well around the 85 and 90 mark. It’s perfectly fine and normal to have these but let’s not discard them immediately. You can either write these off as noise or you can try and investigate them further, it’s really up to but do at least consider them before waving them off.

On the other hand, if you have values that are significantly off the central peaks towards the left of the graph i.e. have taken a lot less time than they should, there is a possibility that there is a bug in your code, and potentially same could be true for values that have taken a lot longer.

## 6. Analysing the Distribution

While the two graphs in the previous section are useful at a glance they do not tell us much about the data. Looking at them again, it looks like they are pretty much a normal distribution, ignoring the stray values (or outliers). But actually, we need to be sure. There are different tests and test statistics that operate on parametric data (normal distribution) vs non-parametric (non normal distribution). This is because parametric data is expected to conform to a Gaussian distribution. In most cases there is a very high probability (I’ve not calculated it) that the data you’re dealing with is non-parametric. But just in case, let’s run a basic test to make sure.

If you’re dealing with a much smaller sample size (< 200ish) you can use the Shapiro-Wilk test or the Anderson-Darling test or even the Kolmogorov-Smirnov for continuous data but be careful as they each have a few weaknesses; here is a good guide explaining the differences. For our use case we’re going to plot a Quantile-Quantile Plot which is generally a good test when dealing with large sample sizes. A quantile, basically divides a range into intervals with each interval sharing a similar probability. This plots the quantiles of two distributions against each other. To put it simply, a QQ plot will plot the default normal distribution as a straight line ($y = x$) and our data as a bunch of points where we can try and see how well it fits to the Gaussian line. The closer it is to this line, the more likely it is a normal distribution.

Adding SciPy & NumPy to the mix of python modules we’re going to use, gives us the code below to create a QQ-Plot:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

z = (data_arr - np.mean(data_arr)) / np.std(data_arr)

stats.probplot(z, dist="norm", plot=plt)

plt.title("Method A", fontsize=30)

figure = plt.gcf()
figure.set_size_inches(20, 11.25)

plt.show()


This gives us the following two graphs:  The red line is the gaussian distribution while the blue dots are from our data. Although Method A is much more close fitting to the normal distribution, it tails off towards the end. Method B is much more clear that it is not a normal distribution. If either of them were, the blue dots would be a straight line following the red line without curves or bends. You can probably skip this step in the future since most probably you’ll be dealing with non-parametric data. I’ll briefly mention further up in this post what to do if both your data sets are indeed parametric.

## 7. Rejecting a Hypothesis

Now that we have established that we’re dealing with non-parametric data, we can start comparing the two distributions to hopefully be able to prove our alternate hypothesis. The data we are dealing with can be categorized as independent samples since the values from Method A are not dependent in any way on Method B or vice versa as compared to dependent or paired samples. This distinction is important for choosing the right test. If you’ve actually ended up with parametric data, use the t-test to reject your null hypothesis.

For our non-parametric data we shall be using the Mann-Whitney U-test, also called the Wilcoxon Rank-Sum test. Not to be confused with the Wilcoxon Signed Ranked test which is similar but used for dependent samples. The Mann-Whitney test computes a test statistic U and a pvalue. For our distributions to be different we need the pvalue returned to be less than 0.05 for our 95% confidence level. This value is known as $\alpha$ which is $(1 - 0.95)$ (the CL).

The code below shows results from running the Mann-Whitney test on our data:

import scipy.stats as stats

u_stat ,p_value = stats.mannwhitneyu(data_a, data_b, alternative='two-sided')

print(u_stat)
print(p_value)

# Output:
# 292409854.0
# 0.0


A few things to note here. Firstly, we’re doing a two-sided test here, which means that our alternate hypothesis checks for values that are either greater or less than the original value i.e. as long as they are not equal, that means they are different. Secondly, the pvalue obtained here is actually not zero. It’s a lot less than zero, which I confimed using R’s Wilcoxon Rank Sum. If you want to experiment a bit more, you can also try the KS test as mentioned in an earlier section.

Since the pvalue we’ve obtained is less than 0.05, this now means we have rejected the null hypothesis, the distributions are different!! If you’re unfortunate enough that you were unable to reject the null hypothesis this does not automatically accept the null hypothesis either. But the journey doesn’t quite end here for us, even if you’ve failed to reject it.

Before we start calculating our central tendency (mean / median / mode etc) let’s revisit the null hypothesis briefly. If we want to further improve upon Method B from this point onwards, we can use a different hypotheses

$H_{0} : \mu \ge x$
$H_{1} : \mu \lt x$

$\mu$    central tendency or test statistic
$x$    current value of the central tendency or test statistic

This is called a one-sided test because you are only testing for significance on one side of the graph (< $x$), thus changing the above code to use alternative='less'. We shall look at choosing our central tendency and calculating it in the next section.

## 8. Comparing Numbers

We now need to show which method is better and by how much. Let’s start calculating our central tendency and spread. Generally speaking, we should use median and quartiles (a type of quantile) here instead of mean and standard deviation. Non-parametric distributions are affected by the outliers and the “tails” (values at the extremes) which can make the standard deviation misleading, in some cases it can even be outwith the range! In case of parametric data, mean and median are pretty much the same.

Use the following code to calculate the median for both the data sets:

import numpy as np

data_median = np.median(data_arr)
print(data_median)

# Results
# Method A: 99.616 ms
# Method B: 69.388 ms


Using the values mentioned in the code snippet above, we can conclude that $\mu$ can be the median time taken in milliseconds while $x$ can be 69.388 ms. So to reject our null hypothesis next time around we would require our median for the new set of data to be less than 69.388 ms.

Moving on, it’s now time to visualise all this in one graph. We’ll achieve this using a Box-Whisker plot. This is pretty much the main graph that can be used to present your data. The following code will create a Box-Whisker plot with the medians and the lower & upper quartiles for both methods:

fig = plt.figure()
fig.set_size_inches(20, 11.25)

# Need to do this to add separate xtick labels

axis.set_ylabel("Time Taken (ms)", fontsize=20)
axis.set_title("Comparison of Method A vs Method B", fontsize=30)

# Float arrays for method a & b
data = [data_arr_a, data_arr_b]

axis.boxplot(data, sym="+")

# has to come after boxplot function
axis.set_xticklabels(["Method A", "Method B"], fontsize=20)

plt.show()


Which gives the following graph. If you want to takeaway one thing from this post, it should be this - using a boxplot to show your data. The medians are shown with a yellow line and the spread of the data is shown by the box around the median which is also called the interquartile range (IQR) i.e. is the range where 50% of the data is. The notched lines emerging from either end of the box are called the whiskers and represent around 24.65% of the data on each side (up until it encounters a point) while any points outside any of these ranges are outliers. You can try to play around with the arguments for the axes.boxplot function to change the various features of the boxplot. Another advantage of the box plots is that you can clearly differentiate the distrubtions.

Lastly, and perhaps the most important calculation is measuring the effect size. Effect size comparing the two sets of samples and quantifying the differences. Effectively, it divides those differences into categories of effects: small, medium and large, with large being the most significant one. To compute the effect, we’ll be using the Vargha-Delaney A measure. It does not depend on either the type of distribution of the samples nor the ordinality of the data (continuous or not). Doing this is quite simple and the following function calculates the A-measure:

import numpy as np
import matplotlib.pyplot as plt

# Note: A and B are not interchangeable
# Here you are using input_a as the original technique
def a_measure(input_a, input_b):

data_a = np.asarray(input_a, dtype=np.float64)
data_b = np.asarray(input_b, dtype=np.float64)

len_a = len(data_a)
len_b = len(data_b)

combined_arr = np.concatenate((data_a, data_b))

# Rank the data in order
ranked_combined_arr = stats.rankdata(combined_arr)

# Get only the ranks of array A
first_a_ranks = ranked_combined_arr[0:len_a]

sum_ranks = np.sum(first_a_ranks)

numerator = (sum_ranks / len_a) - ((len_b + 1) / 2)
a_measure = numerator / len_b

# Returns a value between 0.0 and 1.0
return a_measure



Running this function on the data returns 0.9999995007010705 as the A measure. Interpreting the value returned by the A measure is as follows:

 < 0.5 Technique B is worse ~= 0.5 No effect, both techniques are pretty much the same  >= 0.56 && < 0.64 Small Effect >= 0.64 && < 0.71 Medium Effect >= 0.71 Large Effect

Thus the given the A measure obtained from the data, we can conclude that Method B has a large effect, which potentially means a significant improvement. Although how you further interpret these effect sizes is up to you. There are cases where even saving a couple of milliseconds here and there can save millions of dollars which means that even a small effect can be very significant.

## 9. Summary

We’ve been over quite a few topics here but only just scratched the surface and skipped quite a few things here and there. While a lot was covered, it wasn’t covered in much depth, but hopefully it serves as a helpful platform to start researching and exploring more along these lines. While Comparing Numbers might be the most relevant section, it is quite important to know how we arrived at that point. In short, next time you want to do some performance comparison or statistical analysis on some data use the following steps as a guideline:

1. Choose a good metric
2. Choose an appropriate sample size
3. Do a few warm up runs
4. Sample the data
5. Plot the raw data
6. Check for normality (optional)
7. Perform a statistical hypothesis test to compare distributions (optional)
8. Choose and calculate your central tendency (mean or median)
9. Create a Box-Whisker plot to visualise that statistic
10. Calculate effect size

TL;DR - All this is a very long winding way of saying use a box plot and calculate effect size, because realistically, these are the two main things that will make the biggest difference in your day to day performance comparisons.

Lastly, while all the code is in Python, you can accomplish all of this in R as well.