I work in an industry where customers with dependable month to month patterns are the norm. In the majority of cases, we can physically look at a customer's buying history and for certain months guess the customer's sales pretty accurately. For example, this chart shows the sales of one of our customers.
The lack of y-axis values is intentional to protect this customer's privacy. Sales in December to February are pretty weak, but we usually get huge sales in June and August while having a low month in July.
This gives us some advantages, the most obvious being making inventory planning much easier. However, the another advantage that we have is shuffling the list of customers that our customer service team calls each week to include those customers who were supposed to buy a lot based on their pattern, but didn't for some reason.
Unless they had a larger than expected previous month based on their pattern, we can assume that at least one of three things is true,
In all three cases we want our customer service people to call them. Why? You may have seen numbers that show that targeted emails to specific people are way more effective than emails broadcast to a large number of people. We are achieving the exact same effect, but with phone calls because it's more personal, and our customers are not tech savvy (we still get orders via fax by the way).
The problem is, we have thousands of customers. There's no conceivable way that our customer service team could monitor more much more than 100 every month on top of the existing things they have to do.
This is where machine learning comes in. We're going to create a Python script which will
To achieve this, we will use the popular machine learning Python library, sklearn, to do regression analysis. A regression analysis is a statistical model which predicts the value of a system which has numerical rather than categorical results. Put another way, you use regression when you need to predict a number from given data using existing data.
Our existing data is the sales histories of our customers, and the data we are going to give is the current month and year to get a prediction of what the customer will buy.
This will not act as a tutorial for sklearn. If you're interested, see their Getting Started Guide
So, we need something which, given our sales history, will give us a prediction
for this month. The most accurate regression I've found for this purpose in sklearn
is the RandomForestRegressor
.
Our script is going to need the following:
The specifics of getting the pandas DataFrame
relies on your data setup. At the
end, you should have a data frame the following structure
Customer Name | Sales | Month | Year |
---|---|---|---|
Customer 1 | 1000 | 1 | 2009 |
Customer 1 | 50000 | 2 | 2009 |
Customer 1 | ... | ... | ... |
Customer 2 | 80000 | 1 | 2009 |
Customer 2 | 9000 | 2 | 2009 |
Customer 2 | 12000 | 3 | 2009 |
Because the data we're going to be giving the model will be in the format of
[name, sales, month, year]
, our given data needs to be in the same format.
Here's the code for the DataFrame
,
def get_dataframe():
rows = [from your database]
df = pandas.DataFrame.from_records(
rows,
columns=['CustomerName', 'Sales', 'Month', 'Year']
)
df["CustomerName"] = df["CustomerName"].astype('category')
return df
Next, we're going to feed our regression analysis this data, and get predictions for every customer this month. So, it's up to us to figure out what we determine what's out of the ordinary. The easiest way is to use a simple heuristic with the following rules:
The following code implements these rules,
def predict_heuristic(previous_predict, month_predict, actual_previous_value, actual_value):
""" Heuristic that tries to mark the deviance from real and prediction as
suspicious or not.
"""
if (
actual_value < month_predict and
abs((actual_value - month_predict) / month_predict) > .3):
if (
actual_previous_value > previous_predict and
abs((previous_predict - actual_previous_value) / actual_previous_value) > .3):
return False
else:
return True
else:
return False
The numbers and method here can (and probably should) be adjusted for your specific situation. These are just the numbers I found to be the best for reducing false positives in our business.
Finally, here's where we will create our RandomForestRegressor
, grab our
predictions from it, and use our heuristic to filter the results,
from sklearn import ensemble
from sklearn.feature_extraction import DictVectorizer
def missed_customers():
""" Returns a list of tuples of the customer name, the prediction, and
the actual amount that the customer has bought.
"""
raw = get_dataframe()
vec = DictVectorizer()
Because sklearn doesn't understand anything but numerical inputs
we use a "vectorizer". DictVectorizer
takes
a list of dictionaries and transforms our categorical customer names
to numerical inputs automatically.
# setup
today = datetime.date.today()
currentMonth = today.month
currentYear = today.year
lastMonth = (today.replace(day=1) - datetime.timedelta(days=1)).month
lastMonthYear = (today.replace(day=1) - datetime.timedelta(days=1)).year
results = []
# Exclude this month's value
df = raw.loc[(raw['Month'] != currentMonth) & (raw['Year'] != currentYear)]
for customer in df['CustomerName'].unique().tolist():
# compare this month's real value to the prediction
actual_value = 0.0
actual_previous_value = 0.0
# Get the actual_value and actual_previous_value
try:
actual_previous_value = float(
raw.loc[
(raw['CustomerName'] == customer) &
(raw['Year'] == currentYear) &
(raw['Month'] == currentMonth)
]['Sales']
)
actual_value = float(
raw[
(raw['CustomerName'] == customer) &
(raw['Year'] == lastMonthYear) &
(raw['Month'] == lastMonth)
]['Sales']
)
except TypeError:
# If the customer had no sales in the target month, then move on
continue
# Transforming Data
temp = df.loc[df['CustomerName'] == customer]
targets = temp['Sales']
del temp['CustomerName']
del temp['Sales']
records = temp.to_dict(orient="records")
vec_data = vec.fit_transform(records).toarray()
# Fitting the regressor, and use all available cores
regressor = ensemble.RandomForestRegressor(n_jobs=-1)
regressor.fit(vec_data, targets)
# Predict the past two months using the regressor
previous_predict = regressor.predict(vec.transform({
'Year': lastMonthYear,
'Month': lastMonth
}).toarray())[0]
month_predict = regressor.predict(vec.transform({
'Year': currentYear,
'Month': currentMonth
}).toarray())[0]
if (predict_heuristic(previous_predict, month_predict, actual_previous_value, actual_value)):
results.append((
customer,
month_predict,
actual_value
))
return results
Now that we have our results, let's print them out for the user,
import locale
if __name__ == '__main__':
locale.setlocale(locale.LC_ALL, '')
customers = missed_customers()
for customer in customers:
print "{} was predicted to buy around {}, they bought only {}".format(
customer[0],
locale.currency(customer[1], grouping=True),
locale.currency(customer[2], grouping=True)
)
And now you have a script which will let you know which customers you should be talking to every month.
This script is pretty specific. It only looks for month to month patterns, and wouldn't be able to detect a customer which buys something, say, every other week. Also, the heuristic is pretty dumb, as it doesn't take into account any details about the customer other than the difference between the prediction and the result. Both of these problems can be fixed with a bit of work with sklearn.
Overall though, having this tool as a jumping off point is be very helpful.
Questions or comments? Feel free to contact me.
PY File icon by Arthur Shlain from the Noun Project