Or, How We Learned to Love Multi-Armed Banditry

Personalization is everywhere these days. From Netflix movie recommendations to optimizing call-to-action font size, never before in recorded history has so much existential grief and mental energy worthy of a Faulkner novel gone into making people click on flashing icons. Everyone’s doing A/B tests now, on everything from their e-commerce frontend or their email marketing campaigns; you (or your UX designer) are probably not an exception. The real question is, do you really need to know what “family-wise error rate” is (you should), or is there another way?

There is! If traditional A/B testing can be described like a coin flip, with both groups having different probabilities of conversion, then Multi-Armed Bandits (MAB) would be like shooting dice. What’s more, if we were really playing for fun and profit we could swap out the normal kind for loaded or biased dice, with certain sides more likely to be rolled than others.

In this scenario, like the itinerant scoundrels we are, we gamble on which option to select at any given instance; like the intelligent itinerant scoundrels we are, we try to maximize our gain over time by selecting options with higher expected values more often than lower valued ones, and try to adjust the relative frequency as we learn new information.

The nice thing about MAB is that it is designed to balance the need for information gathering against the costs of gathering that information. The KPI of interest (click-through rate, for example) can be directly optimized for at the beginning. But we need to determine how to assign weights to each side of the die. There are a number of ways we can accomplish this, such as epsilon-greedy or Upper Confidence Bound, but Thompson Sampling is a particularly elegant solution (and also incidentally, the preferred MAB strategy at WeWork). Put simply, it’s an efficient method to select options based on the probability that an option is the best one.

Suppose we had three types of landing pages, and each page has a different copy and millennial aesthetic, and thus a different but unknown probability of inciting a conversion event. We need to define a probability distribution (typically a Beta distribution) over the possible conversion rates of each page. The expected conversion as well as the uncertainty of our estimates will be encoded nicely in each distribution associated with its landing page.

With Thompson Sampling, we will take random draws from each landing page’s associated probability distribution. Then we take the landing page with the largest sampled value and simply serve it to the user, and record the outcome. Repeat until your product manager is satisfied or the heat death of the universe, whichever is sooner.

This process implicitly optimizes for serving the landing page with both the greatest expected conversion rate as well as the least amount of uncertainty surrounding the estimate (known as “exploitation” in the literature), while still allowing the options with lower expected conversion and higher uncertainty to be selected (“exploration”). Over time, the frequency that each page is selected will be consistent with the probability that the option is the best one. And over time, we can maximize the number of conversions, thus maximizing the number of widgets that your organization eventually sells (desks and hipster vibes, in our case).

The really powerful thing here is that these distributions aren’t static. As we observe how users respond to the landing pages, we can take that information and update the underlying distributions of each option to reflect the information in the new data, as shown below.

In fact, if you had no preconceived notions of what the expected conversion should be for each landing page, you can start with the same distribution for each page, and simply “allow the data to speak for itself.” The selections would be random at first, then start to favor the highest performing option as more data is collected to update the distributions. In this example, landing page “C” gradually becomes the clear winner.

Bandits allow for a smooth transition from exploration to exploitation, and can be faster and more efficient because they move traffic towards winning variations in real time instead of forcing you to wait until the testing period concludes. They’re useful tools if automated and continuous optimization is your goal.

That said, there’s still a place for A/B testing, particularly if statistical rigor and uncertainty estimation is important. The real trick is to identify the right tool for a given situation, and understand the trade-offs.

A parting shot: We recommend holding out a small portion of your traffic that is reserved for random selection. This tweak combats bias in the data collection process and ensures that you have an unbiased random baseline to evaluate the efficacy of your bandit. It can also combat the effects of an unforeseen external or seasonal change that may drastically alter the parameters of the distributions.