Bandit Algorithm for Testing

The conventional way of software testing may not be a scalable solution. Hence many companies use A/B testing. But that may cause losses due to less conversion rate of an erroneous version. Hence the usage of Reinforced learning in such cases helps. One of those is the Bandit Algorithm.

A typical A/B testing arrangement looks like this

  1. But here the issue is supposed this A/B testing runs for one week. Now, version B has 50 % traffic but it is erroneous and hence the conversion rate was less. It means we have losses for that 50 % traffic.
  2. To deal with this we need to handle the Exploit-Explore dilemma. Exploit means finding the best solution and explore means the extent to which we should use that option.
  3. A Bandit algorithm setup is shown below
  1. Here in this diagram if you see that the option on the right is for Bandit Selection. Unlike A/B testing where we run the traffic in the same manner for 6 weeks, here we see that variant A is performing better and hence we select that for exploit. Then for the next 6 weeks, the traffic is being diverted more for option A or variant A. Thus we get to know which variant is better and at the same time we also make fewer losses as descent conversion rate was maintained; because the traffic was diverted to best variant. This is a Reinforced learning approach.
  2. There are various variants of Bandit algorithms like Epsilon Greedy and Upper Confidence Bound.
    In Epsilon Greedy we consider one option to be better and divert more traffic towards that, but at the same time for both variant metrics are being captured. These are being fed to the system.
    Based on that epsilon value varies and thus the amount of traffic to different variants may change.
  3. Benefits of Bandit algorithm over A/B testing is
  4. Earn while you learn — Below is one quote about Bandit selection

They’re more efficient because they move traffic towards winning variations gradually, instead of forcing you to wait for a “final answer” at the end of an experiment. They’re faster because samples that would have gone to obviously inferior variations can be assigned to potential winners. The extra data collected on the high-performing variations can help separate the “good” arms from the “best” ones more quickly

5. Automation is easy as a machine learning algorithm is switching traffic to a better option based on feedback.

6. There may be certain scenarios when running A/B testing for a longer duration is not possible. Hence Bandit selection helps.

7. Bandit algorithm can be easily implemented using Python libraries; like the upper confidence bound algorithm.

Thus usage of Machine learning algorithms in field of software testing helps in gaining more scalability and faster release cycles.

Productivity Engineer at Narvar . Data Science enthusiast. Boxer by passion