We just did an A/B split test. 9 things went wrong (and got fixed!)

http://blog.vendoservices.com/vendo-blog/2016/02/15/we-just-did-an-ab-split-test-here-are-the-9-things-that-went-wrong-and-what-we-learned-when-we-fixed-them

ab-testing-blog

How do you know if you’re better than the next guy?

Ever since we were kids we’ve needed to compare ourselves with others. From having the best baseball card collection to getting the best grades, comparing ourselves to others is in our nature. It’s not our most attractive trait. We know. But we still do it. We just can’t help ourselves.

Vendo is an IPSP, a biller. We’re just as curious as our clients are to know how we compare with the competition. To find out we do A/B tests.  We want to find out if our combination of IPSP billing services powered by human and artificial intelligence (dynamic pricing, localization, currencies, payment methods, risk, customer service, etc.) is better than the competition. It’s a battle of “secret sauces” with thousands of different ingredients.

Here’s how we do it: measure net revenue. The client randomly sends half his traffic to Vendo, half to the other guy. Together we add up the revenue, subtracts the fees and see who made the client more money. We have a winner! It sounds simple.

Well, it can be. For us it wasn’t. Here are the 9 things that went wrong in a recent A/B split test and what we learned as we fixed them. These are real stories. The company is mid-sized. If you’ve been around the industry for a while then chances are you know them.

If you’d rather skip to the end…you’ll find a summary and step by step guide. It takes a maximum of 3 hours of tech time to set up the test.

#1: Traffic wasn’t split evenly.

Sending 25% here, 45% and 30% there, makes things hard to compare. Is the biller that made $100,000 better than the biller that made $90,000? Maybe not, if the traffic is uneven. You can’t make an easy comparison between the two billers. Funky traffic splits introduce unnecessary complexity.

Solution: Split traffic 50/50.

We need to be able to take a quick look at the numbers and see which biller is performing better. Once traffic is split evenly you can say, “Yes, the biller that made me $100,000 is doing better than the one that made me $90,000.”

We worked with our client to write the code to split traffic randomly, 50/50, between billers. Here it is:

For NATS: http://pastebin.com/G8MVGsjH

This is the document with the technical instructions: https://docs.google.com/document/d/1M-e_5aVPEcr-JPJdIvHtDeMkHcgF2SUmoLIS-4iC5Ps/edit#

#2: The A/B split test didn’t run long enough…at first.

It’s tempting to call a test in favor of one biller or the other after a few days. But there are two problems with doing that.

First, small numbers vary a lot. So you can expect a lot of variation at the beginning. It’s just natural.

Second, in a subscription based business model, we can’t measure a biller’s results by just looking at initial sales. We also need to measure the effect on the rebills. That takes time. We need at least a month or two after the initial sale. Rebills have a huge effect on the bottom line.

Solution: Run the A/B split test for a minimum of three months.

An A/B split test can’t be rushed. You can’t rush a pregnant woman into giving birth before her nine months is done. It’s a process, it requires patience. In the test we ultimately want to see who creates the most revenue for your traffic. Lifetime value, which emerges over time, is key. Concentrating solely on conversions doesn’t give you the whole picture.

#3: Too few sales per day.

The reason we trust a poll with 10,000 people more than one with 300 is because it is more likely to accurately reflect reality. There is a minimum number of sales we need to compare billers effectively.

Solution: At least 100 daily sales per biller.

Why 100 per day per biller? We need to know that the results are real and repeatable. That they aren’t random.

#4: It’s not easy to get the comparison data.

We need an API to get access to the other biller’s data to report on traffic and revenue. Other important data include sales, upgrades, rebills, refunds and chargebacks.

We use this data to compare performance for the client. Without the data we wouldn’t actually be able to say which biller makes the client more money.

Solution: We wrote queries with the client to pull the data from NATS (the queries would work with any other system, too).

We created a specific tool to share the A/B split test data. It basically pulled this information from NATS:

  • Biller name
  • Biller transaction ID (Vendo’s or the other biller’s)
  • Transaction Type (NATS transaction type)
  • Site ID
  • Amount
  • Offer ID (Biller’s Join Option ID)
  • Transaction Date/time
  • Buyer’s IP Address
  • Subscription ID (Biller’s subscription ID)
  • Join Date

We’re creating a generic script that will allow any client running an A/B to provide a data API. We’ll update this post when it’s ready. Stay tuned!

#5: Some transaction types were missing from the dataset.

We discovered early in the test were missing some rebills. Together they accounted for 10% of revenue.

Solution: We worked with the client to add the transaction types.

It was a minor error in the script which exported the data. This was changed and we finally had the full dataset.

#6: VAT was missing…then it was there.

VAT was another issue (if you are unfamiliar with VAT check out our Knowledge Base). Historically, NATS revenue included VAT. During the test we realised that the data we were downloading had VAT removed.

Solution: We removed VAT from our data.

This made the comparison fair. Depending on the sources of traffic for the client VAT can impact revenue from 0% all the way up to 20%. That would have seriously distorted the comparison between billers. In our case the difference before the change was 8%.

#7: Lifetime value figures were way off  

When we first checked the lifetime figures they didn’t make sense. After a little investigation we realised that the client had been using the other biller for more than 6 months before the test started. On the first day of the test they had subscriptions in the dataset that were over 6 months old! Clearly revenue from sales made before the start of the A/B split test had to be excluded from the data.

Solution: Only measure sales that originate after the start date

We set a new ‘start date’ for the other biller to match the start of the A/B split test so we could compare figures correctly. Only revenue from initial sales created after the start of the test could be included. The lifetime value numbers fell inline. Phew!

#8: Poor reporting.

While talking to another of our clients we asked him how he compared billers.  He said, “I just look at my bank statement, and whoever makes me more money is the one I tell my tech guy to send more traffic to.”

Some clients don’t even take traffic volume differences into account, don’t look at rebills and don’t measure lifetime value. Why is this? Are they completely ignorant? Nope. They are running successful businesses after all. The reason for bad comparisons is that they lack good reporting tools.

It’s hard to access good data from multiple sources and put it in a format that helps you make a smart business decision.

Solution: Build good reports.

The first part of this story was about accessing the data. The next part is about helping our client’s become confident in the results so that they can take good business decisions.

Here at Vendo we have a multitude of tools to track information and display it. One of our favourite Business Intelligence tools is Qlikview.

Building a report where we could follow the results on a daily basis and see our performance was key. We also started experimenting with Qlik Sense, which allowed us to quickly update the data and review performance on a daily basis.

Vendo’s data is in blue, the other biller is in grey. The comparison is in green if it’s positive (red if it’s negative).

Here you can see net revenue per day and cumulative net revenue (after deductions and fees). Net revenue tells the client: “If I send traffic to this biller I make X and if I send it to that biller I make Y.”

Note: The actual figures have been blurred out for confidentiality reasons.

Screen Shot 2016-02-15 at 10.19.13

It’s important that the client can verify this data. We made the per transaction data for each biller available for verification.

#9: Understanding why one biller outperforms another.

Unfortunately, this can never be answered clearly. Each biller takes thousands of different decisions about how they run their business. Those decisions affect conversion ratios, lifetime values, upgrade ratios, acceptance ratios, rebill ratios, etc. All these things can add up to make big differences between one biller to another.

That can be frustrating when a client wants to reverse engineer the biller. We get it. It reminds us of the Lewis Carroll story of the man who, because he was frustrated with the lack of detail in his map, decided to build the most accurate detailed map the world had ever seen. As he added more and more detail he ended up with a map the exact size of the world, “We now use the country itself, as its own map, and I assure you it does nearly as well.”

Solution: Focus on the result

In the end, what matters is net revenue. By comparing net revenue our clients are able to see which biller performs best. Understanding exactly why one performs better than another will always be somewhat elusive.

We hope this story helped you to see how you can do your own comparison between billers and take the critical decision of which biller to send your traffic to.

Here’s a summary of the A/B split test and a step-by-step guide.
There are three main conditions for the A/B test:

  1. Traffic has to be split 50/50.
  2. A minimum of 3 months.
  3. At least 100 sales per biller.

Accessing the data is crucial, so building an API is a handy option.
Using the correct data is vital so we can make an accurate comparison.

Step by step guide for setting up an A/B split test in NATS
(requires max. 3 hours of technical time)

  1. Get the Vendo A/B Test script.
    1. You can download it from here.
    2. Upload it to a directory in your nats server, for example to http://your-nats-domain.com/ab/
  2. Create two tours in NATS.
    1. Tour A must take users to Vendo
    2. Tour B must take users to the other biller.
    3. Follow these steps:
      1. Duplicate the tour of an existing site in NATS Sites Admin, give it a name, for example: Vendo Tour B.
      2. Create an auto-cascade for Vendo in NATS Billers Admin that follows the new Vendo Tour B tour.
      3. Go to http://your-nats-domain.com/internal.php?page=codes and click on  Details, copy the SiteID and TourID (the numbers inside the round brackets)
      4. Configure the Vendo A/B Script vendo-ab-test.php: open it in a text editor.
        1. Configure $configuration in line 7 by mapping the IDs you got in the previous step.
        2. Configure $natsTrackUrl and $natsSignupUrl parameters with your NATS url.
  3. Configure Apache
    1. Get the apache configuration from here.
    2. Update the configuration with the path to vendo-ab-test.php.
      1. Update lines 9, 14 and 18
    3. Restart apache and the A/B test will start.

Author’s note:

Vendo is an IPSP payment processor that uses data to sell better. We did well in this A/B split test. Why? Well, we believe it’s because of how we express our values in our work. At Vendo we care about data and we value discovery. We seek to make new discoveries every day in the work we do for our clients. Our platform grows revenue with tools like dynamic pricing (here’s an introduction, you can read a blog post that goes more into detail here), advanced fraud detection and localization (23 local languages, local currencies, different payment methods, customer support in 25 different languages) and device optimised templates.

Leave a Reply

Your email address will not be published. Required fields are marked *