Use Machine Learning To Increase Sales From Your Predictable Customers

I work in an industry where customers with dependable month to month patterns are the norm. In the majority of cases, we can physically look at a customer's buying history and for certain months guess the customer's sales pretty accurately. For example, this chart shows the sales of one of our customers.

The lack of y-axis values is intentional to protect this customer's privacy. Sales in December to February are pretty weak, but we usually get huge sales in June and August while having a low month in July.

This gives us some advantages, the most obvious being making inventory planning much easier. However, the another advantage that we have is shuffling the list of customers that our customer service team calls each week to include those customers who were supposed to buy a lot based on their pattern, but didn't for some reason.

Unless they had a larger than expected previous month based on their pattern, we can assume that at least one of three things is true,

  • Their inventory of our products is low
  • They're trying someone else's products
  • They're having a bad month

In all three cases we want our customer service people to call them. Why? You may have seen numbers that show that targeted emails to specific people are way more effective than emails broadcast to a large number of people. We are achieving the exact same effect, but with phone calls because it's more personal, and our customers are not tech savvy (we still get orders via fax by the way).

How We'll Do It

The problem is, we have thousands of customers. There's no conceivable way that our customer service team could monitor more much more than 100 every month on top of the existing things they have to do.

This is where machine learning comes in. We're going to create a Python script which will

  1. Examine the buying pattern of each of our customers
  2. Then come up with a prediction of how much the customer should have bought taking into account their pattern
  3. Use a rough heuristic to determine if the customer is out of the norm, and therefore someone in the company should be notified

To achieve this, we will use the popular machine learning Python library, sklearn, to do regression analysis. A regression analysis is a statistical model which predicts the value of a system which has numerical rather than categorical results. Put another way, you use regression when you need to predict a number from given data using existing data.

Our existing data is the sales histories of our customers, and the data we are going to give is the current month and year to get a prediction of what the customer will buy.

The Code

This will not act as a tutorial for sklearn. If you're interested, see their Getting Started Guide

So, we need something which, given our sales history, will give us a prediction for this month. The most accurate regression I've found for this purpose in sklearn is the RandomForestRegressor.

Our script is going to need the following:

  • A function to get our data as a Pandas DataFrame
  • A function to act as our heuristic
  • A function to return the results

The specifics of getting the pandas DataFrame relies on your data setup. At the end, you should have a data frame the following structure

Customer Name Sales Month Year
Customer 1 1000 1 2009
Customer 1 50000 2 2009
Customer 1 ... ... ...
Customer 2 80000 1 2009
Customer 2 9000 2 2009
Customer 2 12000 3 2009

Because the data we're going to be giving the model will be in the format of [name, sales, month, year], our given data needs to be in the same format. Here's the code for the DataFrame,

def get_dataframe():
    rows = [from your database]

    df = pandas.DataFrame.from_records(
        columns=['CustomerName', 'Sales', 'Month', 'Year']
    df["CustomerName"] = df["CustomerName"].astype('category')

    return df

Next, we're going to feed our regression analysis this data, and get predictions for every customer this month. So, it's up to us to figure out what we determine what's out of the ordinary. The easiest way is to use a simple heuristic with the following rules:

  1. If the sales for the past month are less than were predicted
  2. And the difference from the prediction is more than 30%
  3. And the sales two months ago wasn't 30% higher than the prediction
  4. Then it's out of the ordinary

The following code implements these rules,

def predict_heuristic(previous_predict, month_predict, actual_previous_value, actual_value):
    """ Heuristic that tries to mark the deviance from real and prediction as
        suspicious or not.
    if (
         actual_value < month_predict and
         abs((actual_value - month_predict) / month_predict) > .3):
        if (
             actual_previous_value > previous_predict and
             abs((previous_predict - actual_previous_value) / actual_previous_value) > .3):
            return False
            return True
        return False

The numbers and method here can (and probably should) be adjusted for your specific situation. These are just the numbers I found to be the best for reducing false positives in our business.

Finally, here's where we will create our RandomForestRegressor, grab our predictions from it, and use our heuristic to filter the results,

from sklearn import ensemble
from sklearn.feature_extraction import DictVectorizer

def missed_customers():
    """ Returns a list of tuples of the customer name, the prediction, and
        the actual amount that the customer has bought.

    raw = get_dataframe()
    vec = DictVectorizer()

Because sklearn doesn't understand anything but numerical inputs we use a "vectorizer". DictVectorizer takes a list of dictionaries and transforms our categorical customer names to numerical inputs automatically.

    # setup
    today =
    currentMonth = today.month
    currentYear = today.year
    lastMonth = (today.replace(day=1) - datetime.timedelta(days=1)).month
    lastMonthYear = (today.replace(day=1) - datetime.timedelta(days=1)).year
    results = []

    # Exclude this month's value
    df = raw.loc[(raw['Month'] != currentMonth) & (raw['Year'] != currentYear)]

    for customer in df['CustomerName'].unique().tolist():
        # compare this month's real value to the prediction
        actual_value = 0.0
        actual_previous_value = 0.0

        # Get the actual_value and actual_previous_value
            actual_previous_value = float(
                    (raw['CustomerName'] == customer) &
                    (raw['Year'] == currentYear) &
                    (raw['Month'] == currentMonth)
            actual_value = float(
                    (raw['CustomerName'] == customer) &
                    (raw['Year'] == lastMonthYear) &
                    (raw['Month'] == lastMonth)
        except TypeError:
            # If the customer had no sales in the target month, then move on

        # Transforming Data
        temp = df.loc[df['CustomerName'] == customer]
        targets = temp['Sales']
        del temp['CustomerName']
        del temp['Sales']
        records = temp.to_dict(orient="records")
        vec_data = vec.fit_transform(records).toarray()

        # Fitting the regressor, and use all available cores
        regressor = ensemble.RandomForestRegressor(n_jobs=-1), targets)

        # Predict the past two months using the regressor
        previous_predict = regressor.predict(vec.transform({
            'Year': lastMonthYear,
            'Month': lastMonth
        month_predict = regressor.predict(vec.transform({
            'Year': currentYear,
            'Month': currentMonth

        if (predict_heuristic(previous_predict, month_predict, actual_previous_value, actual_value)):

    return results

Now that we have our results, let's print them out for the user,

import locale

if __name__ == '__main__':
    locale.setlocale(locale.LC_ALL, '')
    customers = missed_customers()
    for customer in customers:
        print "{} was predicted to buy around {}, they bought only {}".format(
            locale.currency(customer[1], grouping=True),
            locale.currency(customer[2], grouping=True)

And now you have a script which will let you know which customers you should be talking to every month.

Where To Go From Here

This script is pretty specific. It only looks for month to month patterns, and wouldn't be able to detect a customer which buys something, say, every other week. Also, the heuristic is pretty dumb, as it doesn't take into account any details about the customer other than the difference between the prediction and the result. Both of these problems can be fixed with a bit of work with sklearn.

Overall though, having this tool as a jumping off point is be very helpful.

Questions or comments? Feel free to contact me.

PY File icon by Arthur Shlain from the Noun Project