week2_1_explore_exploit

Question 1: Why is the Explore-Exploit Dilemma so useful?¶

The concept of comparing things to see which one is better can be applied to almost any business. Imagine you are advertising new iPhone products, or you just created a brand new IPhone and you ask your designers to make a couple of advertisements. Then your designers show you 3 ads and you really love them because the designers are ingenious. So, which should you choose? Of course, you should naturally choose the advertisement for the iPhone that you believe most likely to lead the user to click and purchase.

Question 2: How to measure your decision in an explore-exploit dilemma?¶

One possible way to measure your decision is to determine the click-through rate of each advertisement. It is the ratio of the number of user clicks divided by the total number of impressions. For example, if I show an advertisement to 10,000 users but only 1000 of those users clicked and purchased the new iPhone, then the click through rate is 10%.

Now the question is: how to measure the click through rates in practice? We can do experiments! We can show the first advertisement to 10,000 users, the second advertisement to another 10,000 users, and the third advertisement to yet another 10,000 users. And then I compare the click through rates of the three advertisements and find out which advertisement attracts the most users. But there is a problem.

How do I know 10,000 users are enough or they're too much? In fact, I don't! Why don't I show these advertisements to 1 million users? Why don't they just show the ads just 100 times? Theoretically, we would need an infinite number of samples. To get absolutely precise estimates for the click through rate. As we have more samples, the probability distribution is expected to shrink towards the center values so it becomes more precise.

Question 3: What is the actual strategy to optimize the choice?¶

If one advertisement is better, it necessarily means that another advertisement is worse. This may indicate that the better advertisement will generate more profits, while the worse advertisement will yield less profits; if I show the worst advertisement 10,000 times, I've wasted 10,000 impressions for a suboptimal click through rate. We remember from the beginning that our goal is to always show only the best ad to exploit the highest click through rate.

But there is a problem. If we don't know an accurate answer for which ad yields the best profits in the first place, we can't confidently bet on a single advertisement. In other words, we always need to explore all possible options until we're absolutely sure about the best option. But in order to find the best option, we must learn the consequences of its actions through trial and error, rather than being explicitly told the correct action.

This dilemma exemplifies the explore-exploit dilemma.

In this week or two, we'll look at how to use the Gaussian distribution and Bernoulli Distribution to optimize the strategy of making choices. If you recall,

Bernoulli distribution uses the probability of success as a parameter, so it can be used to measure success and failure -- such as click through rate. It is ideal for scenarios when you only have two outcomes, click or no click, buy or no buy, and there's nothing in between.
But what if your reward is a continuous value? Some examples are measuring revenue or profits, or the number of users acquired for a software application. In these cases, Gaussian distribution should be more appropriate.

Building on top of these concepts, over the years of computational advancement, people have tested strategies such as the epsilon greedy method, the upper confidence bound method, Thompson sampling method, all of which are useful such that we'll look into them in the following tutorials.

Question 4: What are other applications of Bayesian methods to solve the explore-exploit dilemma?¶

Now, obviously, Apple is not the only company in the world selling things -- the online advertising industry is a billion dollar industry. So the techniques you're learning how to solve the explore-exploit dilemma are transferable to almost all modern businesses. However, you don't have to own a successful business to use these techniques.

You can simply be someone who creates and owns a website on the internet. Assuming you have two or more traffic options available to choose for the website, you can apply these techniques to pick the best service option. In educational settings, if you're offering a new course as I do, and you're not sure if multiple choice questions, or coding problems can better help interested novice programmers to learn programming, you can test the learning process and the learning outcome among various types of assessment using some of these techniques.

Question 5: Why should we even bother using bandit algorithms to test out new ideas when optimizing websites? Isn’t A/B testing already sufficient?¶

In course 2, we have explored Bayesian A/B testing, where we could determine whether a new web design works better by performing a controlled experiment. In the A/B testing - controlled experimentation, we randomly assign an incoming web user to one of the two groups: Group A or Group B. This random assignment of users to groups continues until the web developer becomes convinced that either Design A is more successful than Design B, or Design B is more successful than Design A! After that, the web developer should assign all future users to the more successful version of the web design and eliminate the inferior version. The Bayesian A/B approach has been successful in the past and will continue to thrive in many contexts. So why should we teach the bandit algorithms?

Let's look back to the A/B testing framework. A/B testing consists of:

A short period of pure exploration, in which you assign equal numbers of users to Design A and Design B
A long period of pure exploitation, in which you assign all users to the more successful version, and never come back to the option that seemed to be inferior

So what's the limitation of A/B testing?

During the purely exploratory phase, we waste unnecessary resources exploring inferior options in order to gather as much information as possible. We can actually spend less resources to gather data about strikingly inferior options.
A/B testing jumps discretely from exploration into exploitation, but we can't rule out the risk of choosing the suboptimal option in the long run

Given these drawbacks, the bandit algorithms focus our resources during the exploration step on the more successful options, instead of over-exploring the inferior options. These algorithms gradually fixate on the best available option by recording all past performances over time, and a good bandit algorithm will ultimately converge.

Taking A/B testing and bandit problems together, the framework to maneuver between exploration and exploitation will be useful whichever method you choose because bandit algorithms subsume A/B testing as a special case (i.e. the extreme case in which you jump from pure exploration to pure exploitation). I hope you now gain a sense that understanding bandit algorithms will empower you to operate in the much larger and more flexible space between the two extremes and widen the lens of addressing explore-exploit tradeoff for real-life problems. In order to see how bandit algorithms achieve that balance, let’s get started!

Explore-Exploit Dilemma¶

Question 1: Why is the Explore-Exploit Dilemma so useful?¶

Question 2: How to measure your decision in an explore-exploit dilemma?¶

Question 3: What is the actual strategy to optimize the choice?¶

Question 4: What are other applications of Bayesian methods to solve the explore-exploit dilemma?¶

Question 5: Why should we even bother using bandit algorithms to test out new ideas when optimizing websites? Isn’t A/B testing already sufficient?¶