The Illusion of Early Wins in Design Experimentation Data

How Small Sample Sizes in AB Testing Can Fool the Uninitiated.

Jun 19, 2024

I’m putting my foot down.

Well… My fingers, really…as I begin typing into the Slack thread:

“We’re absolutely not bringing this up with the client today”

I wait for the response. Distracting myself on my phone for a moment.

…Charlie is typing…

“But it’s performing great. Why wouldn’t we let them know?” They respond.

They don’t get it. Why would they? This is only the second round of AB tests they’ve launched in their career.

I’m looking at the same dashboard she is, yet we’re seeing totally different things.

In front of me, the screen reads:

Variant A: 20.53% increase in conversion rate.

Variant B: 24.72% increase in conversion rate.

It seems like a huge positive result, right?

Wrong.

The truth is, by tomorrow they could be saying the complete opposite.

This is one of the most important components to get right when AB testing: Identifying if your results really mean anything.

How do you tell whether the numbers you’re looking at aren’t just fluctuations caused by chance? We have a phrase for this in the biz. It’s called “Statistical Significance”.

We use this to describe the likelihood that what you’re observing is genuine, and not a result of random variation in the data.

I leave Charlie hanging for a moment while I empty the contents of my wallet on to my desk. Among the debris, I pick up one of the coins. It has a gold centre, wrapped by a ring of silver. Minted in the year 2000.

I flip it into the air, watching it glint in the afternoon sun, before catching it on the back of my hand.

Heads.

I smile, and repeat my little trick.

Tails.

After five times, the tally is 4:1. Only a single Tails.

Interpreting this data, you could be tempted to assume this coin was heavily biased towards landing on Heads.

But it’s not. The data is unreliable. We haven’t collected enough data points yet.

I keep flipping.

It’s now at 5:8 in favour of Tails.

Now it seems like the coin is heavily biased in the other direction.

Assuming this is a normal coin, (and not one created to hustle fools in dark alleyways or perform magic tricks) the likelihood of it landing on either side is 50:50 for any one flip. But real life doesn’t look like statistics, so you won’t always get equal numbers of each over the short term.

With only a small number of flips, the results can swing wildly from one side to the other.

From Heads. To Tails.

The numbers mean almost nothing.

But as I keep flipping, over time, the coins true nature of the coins is revealed.

I get lost in the moment, completely forgetting the team I’m meant to be responding to. Instead, I continue flipping the coin. After 1000 spins the final tally is 503 Heads to 497 Tails.

Almost 50:50.

Just what I expected.

The more I flip the coin, the more accurate the picture becomes.

When I’m running AB tests for my clients, I’m playing a similar game.

Testing different versions of a design to see if either one lands more on heads, or more on tails.

Only this time - instead of heads or tails - I’m measuring valuable business metrics such as conversion rate.

Only this time, I’m expecting my coins to be biased.

My goal is to figure out which one is biased in my client’s favour. Which will land more times on Heads.

Just like with the coins, early on, these numbers can vary a lot, giving an inaccurate image of how the test is going. Swinging results from one side to the other.

So we keep the tests running.

Sometimes the differences turn out to be tiny, requiring hundreds of thousands of flips to notice the difference.

Sometimes the differences aren’t noticeable at all. No matter how many times we flip, the results keep swinging ever so slightly from one side to the other.

Heads. Tails. Heads. Tails. Over and over.

The truth is that even the most successful brands in the world like Airbnb or Netflix only report positive results around 10% of the time from their experimentation. They even wrote a paper about it.

But every now and again you get that one design that wipes the floor with the others.

Heads. Heads. Heads.

I love those days.

But even then, reliable results take time. Trusting the data too soon can lead you down the wrong path.

…

I put the coin back in my wallet and refocus my attention back to Slack.

“We’re not sharing the data because, right now, these numbers mean nothing.”

It was simply the random fluctuations of a few coin tosses.

Measuring Design

Discussion about this post