Executive Summary

The Data

I examine online retail sales data found in the Online Retail dataset that contains all transactions for about a year around 2011 for a UK based online retail. The products are unique gift items and many of the customers are wholesalers.

The dataset contains 25900 invoices for 4373 customers. It needed only little cleaning - I deleted invoices with no customer, I removed a few invoice rows with non-product items like handling and postage, and I removed two cancelled orders with exceptionally large quantities.

After cleaning, the dataset contains 21784 invoices for 4362 customers.

The Goals

Part 1: Customer Segmentation (Clustering)

RFM, Recency, Frequency, and Monetary Value, is an often used way to measure the value of a customer. I introduce a new Customer Segmentation based on the RFM data computed by the Kmeans clustering algorithm and compare it to a traditional fixed Customer Segmentation defined by PUTLER also based on RFM.

Part 2: Predicting Customer Segment for new Customers (Classification)

I examine if it is possible to predict which Customer Segment a new user will end up in at the last date of the dataset based on the size of the customers first invoice in the dataset and the customers demographics, i.e. the customers country.

I classify to both of the two kinds of customer segment sets computed in the first part, and I use and compare the K-nearest neighbor (knn) algorithm, the Random Forest algorithm, and an approach with random segment selection. I chose the knn and Random Forest algorithms because they are base on very different principles.

The Results

The Result Part 1: Customer Segmentation

Each dot in the following two plots is a customer. The position represents the R, F, and M values and the colors represent the customer segment in each of the two systems:

The Putler segmentation has eleven segments and is shown for comparison in the left plot. The new Customer Segmentation computed by Kmeans clustering has seven segments and is shown in the right plot. The percentages of customers in each segment and the names I gave the segments are:

	Name	Cust. pct	Desciption
1	Small and Occasional	9.2%	Recently active but spends low and not very often
2	New	57.8%	Fairly recently active but spended low and have not been active earlier
3	Good Average	0.9%	Active and buys from seldom to often and spends from low to average
4	Big Spenders	0.1%	Spend most money of all and they recently did it again
5	Inactive	14.1%	Not active for a long time and spended low
6	Fading Away	17.7%	Not active for some time and spended low
7	Good and Loyal	0.1%	Spend average or more money and buy very often

The Result Part 2: Predicting Customer Segment

The accuracy of predicting the customer segment from the customers first invoice is:

Method	Putler Segments Accuracy	Kmeans Segments Accurary
Random	0.127	0.355
K-nearest neighbor	0.137	0.306
Random Forest	0.178	0.391

Conclusion

Customer Segmentation

My Machine Learning generated customer segments ends up with seven customer segments compared to eleven for the PUTLER approach. Other traditional RFM approaches, however, splits the the customers into between about five and twenty segments, and so the my Machine Learning generated customer segments are within what can be considered normal. The analysis, though, shows that the two sets of customer segments splits the customers in very different ways, and the segments of the two approaches do not look like each other. It is impossible to say which set of the segment sets that are best because it depends on the intended use.

Further work could be done to make an analysis of the development over time of customers. To find dynamic trends or customer lifecycles would however require a dataset that covers a longer period of at least several years for a business like the one, I analysed in this report.

Predicting Customer Segment

The K-nearest neigbor algorithm predicted the future customer segment worse than randomly for the new Kmeans segments! The Random Forest algorithm predicted the future customer segment of both customer segment sets a little bit better than random and also better the K-nearest neighbor algorithm. The accuracy is still low for both segment sets, though. Predicting using Random Forest might be a beginning of helping the business to a finer segmentation of new customers.

It is clearly easier to predict the Kmeans segments - this is quite naturally because of the fewer segments and the very high number of customers in a single segment.

In this report I did not use the information of the content of the first order and it would be interesting to investigate if using this data could help increase the accuracy of the predicting. Also the Random Forest analyses had a problem finding the best value of mtry and this should be fixed.

The Data and the Goal

I have chosen to examine online retail sales data found in the Online Retail dataset from https://archive.ics.uci.edu/ml/datasets/Online+Retail. The documentation describes the dataset as:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

In the first part of this report I compare a traditional RFM Customer Segmentation of this data to a new Customer Segmentation of this data computed by the Kmeans clustering algorithm.

In the second part of this report I examine if it is possible to predict which Customer Segment a new user will end up in after a year, i.e classification of new customers. I classify to both of the two kinds of customer segment sets computed in the first part, and I use and compare the K-nearest neighbor (knn) algorithm and the Random Forest algorithm for the classifications. I chose knn and Random Forest because they work on different principles.

Data Exploration and Cleaning

Basic Information

The original data Online Retail.xlsx looks like:

## Observations: 541,909
## Variables: 8
## $ InvoiceNo   <chr> "536365", "536365", "536365", "536365", "536365", "5…
## $ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22…
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL L…
## $ Quantity    <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, …
## $ InvoiceDate <dttm> 2010-12-01 07:26:00, 2010-12-01 07:26:00, 2010-12-0…
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85…
## $ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 178…
## $ Country     <chr> "United Kingdom", "United Kingdom", "United Kingdom"…

Each of the 541909 rows is one line of an invoice. Rows with the same InvoiceNo make up an invoice for a customer.

The documentation of the dataset describes the fields:

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

The data contains 25900 different invoices and 4373 different customers. The temporal extension of the data is from 2010-12-01 07:26:00 to 2011-12-09 11:50:00, i.e. a couple days more than one year of data.

The first few lines of the data and a summary of the data is:

##   InvoiceNo          StockCode         Description
##  Length:541909      Length:541909      Length:541909
##  Class :character   Class :character   Class :character
##  Mode  :character   Mode  :character   Mode  :character
##
##
##
##
##     Quantity          InvoiceDate                    UnitPrice
##  Min.   :-80995.00   Min.   :2010-12-01 07:26:00   Min.   :-11062.06
##  1st Qu.:     1.00   1st Qu.:2011-03-28 10:34:00   1st Qu.:     1.25
##  Median :     3.00   Median :2011-07-19 16:17:00   Median :     2.08
##  Mean   :     9.55   Mean   :2011-07-04 13:02:44   Mean   :     4.61
##  3rd Qu.:    10.00   3rd Qu.:2011-10-19 10:27:00   3rd Qu.:     4.13
##  Max.   : 80995.00   Max.   :2011-12-09 11:50:00   Max.   : 38970.00
##
##    CustomerID       Country
##  Min.   :12346    Length:541909
##  1st Qu.:13953    Class :character
##  Median :15152    Mode  :character
##  Mean   :15288
##  3rd Qu.:16791
##  Max.   :18287
##  NA's   :135080

From the summary we make two suspicious observations that need to be investigated further later in this report:

The lowest quantity is negative and the highest quantity is the positive of the same number.
The lowest unit-price is negative and the highest unit-price is nearly UK £ 40,000.

Explore and Clean the Data

Missing data

The number of missing data in each field is:

## Rows with missing InvoiceNo = 0

## Rows with missing StockCode = 0

## Rows with missing Description = 1454

## Rows with missing Quantity = 0

## Rows with missing InvoiceDate = 0

## Rows with missing UnitPrice = 0

## Rows with missing CustomerID = 135080

## Rows with missing Country = 0

I want to analyse the customers and I cannot use invoice rows with no customers. But before I remove the rows with no CustomerID I check if all invoices have exactly one customer. The following prints the number of invoices that have rows with different custumerID’s or have a mix of NA’s and different customerID’s:

onlineRetail %>%
    group_by(InvoiceNo) %>%
    summarize(numberOfCustomerIDs = length(unique(CustomerID))) %>%
    filter(numberOfCustomerIDs != 1) %>%
    nrow()

## [1] 0

This means that no invoices have rows with a mix of customers, or no customer, and I delete all rows without CustomerID:

onlineRetail = onlineRetail[!is.na(onlineRetail$CustomerID), ]
nrow(onlineRetail)

## [1] 406829

And now all the empty descriptions are gone too:

## Description rows with missing data = 0

InvoiceNo

The data contains 22190 different invoice numbers, and all invoice numbers have the right format of 6 digits with or without a ‘C’ prefixed, as the resulting 0 in the following code shows. Nothing needs to be cleaned here:

nrow(onlineRetail[!grepl("^C\\d{6}$", onlineRetail$InvoiceNo) & !grepl("^\\d{6}$", onlineRetail$InvoiceNo), ])

## [1] 0

StockCode

The data contains 3684 different stock codes. The documentation says that StockCode consist of 5 digits. I try to print up to ten rows where StockCode does not consist of exactly 5 digits:

head(onlineRetail[!grepl("^\\d{5}$", onlineRetail$StockCode), ], 10)

So it is seen that not all StockCode follow the documentation. Analysing a little deeper we see that no StockCode contains a lowercase letter:

nrow(onlineRetail[onlineRetail$StockCode != toupper(onlineRetail$StockCode), ])

## [1] 0

We see, though, that the StockCode has a much broader format than told in the documentation. One other format is to append one or two letters to the StockCode which occurs in 33889 rows. Apparently this is often used to indicate the product color as in this example:

nrow(onlineRetail[grepl("^\\d{5}[A-Z]{1,2}$", onlineRetail$StockCode), ])

## [1] 33889

All these products seem to be genuine and valid products and I leave them in the data.

Product codes that differ from five digits with possible one or two letters appended, looks like:

head(onlineRetail[!grepl("^\\d{5}[A-Z]{0,2}$", onlineRetail$StockCode), ], 10)

When collected to a full list of all the different ones we see that they all are about non product items as handling and postage:

onlineRetail[!grepl("^\\d{5}[A-Z]{0,2}$", onlineRetail$StockCode), ] %>%
    group_by(StockCode) %>%
    count(Description) %>%
    top_n(1, wt=n) %>%
    ungroup() %>%
    select(-n)

I remove them from the data because they only are non product items. It will probably not matter much because there only are 1920 invoice rows with one of these product codes:

onlineRetail = onlineRetail[grepl("^\\d{5}[A-Z]{0,2}$", onlineRetail$StockCode),, ]
nrow(onlineRetail)

## [1] 404909

Cancelled Invoices

Invoices with InvoiceNo starting with a ‘c’ is a cancellation invoice that cancels one or more rows in other invoices. For example:

head(onlineRetail[startsWith(tolower(onlineRetail$InvoiceNo), "c"), ], 10)

In these examples all the quantities are negative, and it can be shown that negative quantities correspond to exactly the cancellation invoices because there are no rows with a cancellation invoice with a positive quantity and no rows with a non-cancellation invoice with a negative quantity:

nrow(onlineRetail[startsWith(tolower(onlineRetail$InvoiceNo), "c") ^ onlineRetail$Quantity < 0, ])

## [1] 0

The cancelled rows do not show which other invoice row they cancel. In some cases a probable original invoice row can be found manually, like the suspicious observation about the minimum and maximum quantity made earlier:

onlineRetail[abs(onlineRetail$Quantity) >= 40000, ]

In other cases the cancelled original invoice row cannot be found. For example listing all the onlineRetail rows for customer 17548 show three invoices where one of them, C536391, is a cancellation but there is no rows showing that the customer ever ordered these items - perhaps because the original order was made before the first date of this data:

onlineRetail[onlineRetail$CustomerID == 17548, ]

I decide not to delete cancellation invoice rows in general because the influence the quantity and the amount actually spend by each customer. However, I decide to remove two exceptional orders with Quantity larger than 70000 because they exceptionally large, they were clearly mistakes that were cancelled within 15 minutes, and I do not want them to skew the data:

onlineRetail = onlineRetail[abs(onlineRetail$Quantity) < 40000, ]

Quantity

All quantities are integer numbers as expected as shown by the following zero count:

nrow(onlineRetail[onlineRetail$Quantity != as.integer(onlineRetail$Quantity), ])

## [1] 0

and no quantity is zero:

nrow(onlineRetail[onlineRetail$Quantity == 0, ])

## [1] 0

As seen earlier the lowest quantity is negative and the highest quantity is the positive of the same number. Furthermore, the two largest quantities, 80995 and 74215, are cancelled and I decide to leave the data as is.

onlineRetail[2500 < abs(onlineRetail$Quantity), ]

InvoiceDate

The following example of the lines making up a full invoice can have different timestamps. It looks like the rows have been keyed-in by hand and that it took a little while for each row:

onlineRetail[onlineRetail$InvoiceNo == 544926, ]

The date, however, is constant on all of the invoices, as is seen from the following sorted list of the invoices having the most different dates, and they show that all invoices have at most 1 date:

onlineRetail %>%
    group_by(InvoiceNo) %>%
    summarize(numberOfDates = length(unique(as.Date(InvoiceDate, "%Y-%m-%d")))) %>%
    arrange(desc(numberOfDates)) %>%
    head(5)

UnitPrice

No unit prices are negative but 33 are zero:

onlineRetail[onlineRetail$UnitPrice <= 0, ]

Apparently the company gives away free stuff from time to time. If the items were actually ordered by the customer and then given free to the customer I would want them to the stay in the data to reflect the customers intention. If the items were not ordered by the customer but just given as a free promotional gift they say nothing about the customer and I would delete them from the data. The problem is that I do not know which one is true. Because there are only 33 invoice rows where the unit price is zero I decide to leave them in the data.

CustomerID

The data contains 4362 different customer ID’s. The customer ID’s are all five digit integer as said in the documentation of the data and there is no need to clean-up anything here:

nrow(onlineRetail[onlineRetail$CustomerID != as.integer(onlineRetail$CustomerID), ])

## [1] 0

nrow(onlineRetail[onlineRetail$CustomerID < 10000 | 99999 < onlineRetail$CustomerID, ])

## [1] 0

Each invoice row has a customer and it could be that invoice rows making up a full invoice would have several different customers. This is not so, however, and it is easy to see, that all invoices have no more than one customer. The following is a sorted list of the invoices having the most different customers, and they show that all invoices have at most 1 customer:

onlineRetail %>%
    group_by(InvoiceNo) %>%
    summarize(numberOfCustomers = length(unique(CustomerID))) %>%
    arrange(desc(numberOfCustomers)) %>%
    head(5)

Country

The countries include an ‘Unspecified’ and all seems Ok except that it also includes ‘European Community’ that includes several of the other countries:

unique(onlineRetail$Country)

##  [1] "United Kingdom"       "France"               "Australia"
##  [4] "Netherlands"          "Germany"              "Norway"
##  [7] "EIRE"                 "Switzerland"          "Spain"
## [10] "Poland"               "Portugal"             "Italy"
## [13] "Belgium"              "Lithuania"            "Japan"
## [16] "Iceland"              "Channel Islands"      "Denmark"
## [19] "Cyprus"               "Sweden"               "Austria"
## [22] "Israel"               "Finland"              "Greece"
## [25] "Singapore"            "Lebanon"              "United Arab Emirates"
## [28] "Saudi Arabia"         "Czech Republic"       "Canada"
## [31] "Unspecified"          "Brazil"               "USA"
## [34] "European Community"   "Bahrain"              "Malta"
## [37] "RSA"

It it seen, though, that only one customer (with four invoices) has ‘European Community’ as country and I decide to leave it in the data:

onlineRetail[onlineRetail$Country == 'European Community', ] %>%
    distinct(InvoiceNo, CustomerID)

The Data Types

Finally, I change some of the fields from being strings or numerics to be categorial data. For example the CustomerID that is read as a number from the original data, but it of course make no sense to do arithmetic on customer ID’s, you cannot add two customer ID’s for example. And all in all I end up with the following data ready to be analysed:

## Observations: 404,905
## Variables: 8
## $ InvoiceNo   <fct> 536365, 536365, 536365, 536365, 536365, 536365, 5363…
## $ StockCode   <fct> 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730,…
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL L…
## $ Quantity    <int> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, …
## $ InvoiceDate <dttm> 2010-12-01 07:26:00, 2010-12-01 07:26:00, 2010-12-0…
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85…
## $ CustomerID  <fct> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 178…
## $ Country     <fct> UnitedKingdom, UnitedKingdom, UnitedKingdom, UnitedK…

Sizes after Cleaning the Data

After cleaning the dataset it contains 21784 different invoices and 4362 different customers.

Part 1: Analysis of Customer RFM and Customer Segments

Introduction

RFM is an often used way to measure the value of a customer. The three letters stand for Recency, Frequency, and Monetary Value. For the dataset onlineRetail I use the following:

Recency: the number of days before the last date of the period
Frequency: the number of different invoices and cancellations over the period
Monetary Value: the total amount paid over the period (does take cancellations inside the period into account)

That is, I take the full year-long dataset as the period to compute for. In a real business setting you would get new data every day, and you could periodically recalculate the results I get in this report on data for the last year or two.

My goal is to cluster the customers from their RFM, first using a traditional, fixed algorithm, and second using a machine learning clustering algorithm and finally compare the two customer segmentations.

Calculate RFM for each Customer

I build a new dataset CustomerRFM that contains the RFM values for each customer from the onlineRetail dataset:

RecencyDate = as.Date(max(onlineRetail$InvoiceDate))
CustomerRFM <- onlineRetail %>%
    group_by(CustomerID) %>%
    summarize(Recency = as.numeric(RecencyDate - as.Date(max(InvoiceDate))),
             Frequency = n_distinct(InvoiceNo),
             Monetary = sum(Quantity * UnitPrice))
head(CustomerRFM)

summary(CustomerRFM)

##    CustomerID      Recency         Frequency          Monetary
##  12347  :   1   Min.   :  0.00   Min.   :  1.000   Min.   : -1192.2
##  12348  :   1   1st Qu.: 16.00   1st Qu.:  1.000   1st Qu.:   294.3
##  12349  :   1   Median : 50.00   Median :  3.000   Median :   645.7
##  12350  :   1   Mean   : 91.68   Mean   :  4.994   Mean   :  1899.7
##  12352  :   1   3rd Qu.:143.00   3rd Qu.:  5.000   3rd Qu.:  1596.4
##  12353  :   1   Max.   :373.00   Max.   :242.000   Max.   :278778.0
##  (Other):4356

I need to scale the values before using them. If for example the first feature in a dataset ranges from 0 to 1 and the second feature ranges from 0 to 1000, the computed Euclidean distance between two points, or rows, from the dataset will be nearly totally dominated by the second feature. To avoid this problem the features must be scaled to the same ranges before using them in a clustering algorithm.

I use the following to scale the three feature R, F, and M. The method I use assumes that the values in each feature is close to be normal distributed and scale the values to have mean 0 and standard deviation 1. The values in CustomerRFM are not normally distributed but scaling this way is much better than not scaling:

CustomerRFM$Recency <- scale(CustomerRFM$Recency)
CustomerRFM$Frequency <- scale(CustomerRFM$Frequency)
CustomerRFM$Monetary <- scale(CustomerRFM$Monetary)
head(CustomerRFM)

And we end as expected up with scaled values for each feature:

## [1] "Recency:   mean = -0.000000, sd = 1.000000"

## [1] "Frequency: mean = 0.000000, sd = 1.000000"

## [1] "Monetary:  mean = 0.000000, sd = 1.000000"

Visually Illustrating the Values and Principal Component Analysis

In the following plot it is seen that all three features are pairwise correlated but not in a simple, for example, linear way. Recency, though, seems to be correlated in somewhat the same way to both Frequency and Monetary which suggests that much of the variability in the data can be expressed with only two features, if we can find the suitable features.

pairs(~Recency+Frequency+Monetary,data=CustomerRFM,
    main="Simple Scatterplot Matrix", pch = 20)

To illustrate the RFM values for the customers I first draw a three-dimensional plot where each point is a customer and each of the three coordinates is one of the three RFM features such that the points position is the RFM values of that customer.

scatterplot3d(CustomerRFM$Recency, CustomerRFM$Frequency, CustomerRFM$Monetary,
              pch = 20,
              main="3D Scatterplot",
              xlab = "Recency", ylab = "Frequency", zlab = "Monetary")

It is worth noting that a big part of the space is empty. For example there are no customers that buy frquently but just not recently. A possible explanation is that freqency and recency is connected in that if a customer has not bought in a while then the frequency falls.

It is hard, at least for me, to see the exact three-dimensional positions of the points in this three-dimensional plot. It would be much easier to look at in a turnable, and possibly interactive, three-dimensional plot but this is not possible in the pdf format of this report.

So in addition to the three dimensional plot I also illustrate the positions of the points in a two-dimensional plot where I map all the points from three dimensions into two dimensions using Principal Component Analysis (PCA). PCA computes a new three-dimensional coordinate system where each coordinate, in order, covers as much variability as possible. So dropping the least important coordinate after PCA will usually give a much better result, and never be worse, than just dropping one of the coordinates without PCA.

CustomerRFM_PCA <- prcomp(CustomerRFM[c(2,3,4)])
summary(CustomerRFM_PCA)

## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.2928 0.9498 0.6532
## Proportion of Variance 0.5571 0.3007 0.1422
## Cumulative Proportion  0.5571 0.8578 1.0000

The three new dimensions are sorted by the amount of variability of the data in the coordinates direction (standard deviation). It is seen that the two first coordinates together explain close to 86% of the variance of the data, which means that using the first two coordinates will give a reasonable illustration of the customers RFM values. However, it is an approximation only and there is no need to reduce the number of features from three to two in the further calculations. So I use the PCA for illustration only.

In the following plot it is easy to see, that most of the customers clusters together in one big angular formed cluster with fewer customers radiating out of the big cluster. I had hoped that a small number, five to ten, clusters formed by the positions of the points, i.e. customers, were clearly visible. This is however not the case and any customer segmentation will only be a rough and simplifying description of the much more continuously distributed customers.

plot(CustomerRFM_PCA$x[,1], CustomerRFM_PCA$x[,2],
     xlab="PC1 (56%)", ylab = "PC2 (30%)",
     main = "PC1 / PC2 - plot",
     pch = 20, col = "black")

Customer Segmentation using a Traditional Fixed Method

The Internet contains many sources that explain their way to split customers into Customer Segment using their RFM values. While there seems to be many different ways to segment customers based on RFM, all I have found is based on the same principles and differ only on the details and on the definition of the final customer segments. I use the process from https://www.putler.com/rfm-analysis/ because their method is close to all other methods that I found and they describe it detailed.

First a score 1 to 5 is assigned to each of R, F, and M using quintiles, i.e. it scores 5 if the value is ranked in the best 20%, it scores 4 if ranked in the best 20-40%, etc. For R lower values are best, for F and M higher values are best. I do not need to scale the original RFM values but because the rank is not changed by scaling it does not matter and I use the scaled CustomerRFM that I already have build.

n = nrow(CustomerRFM)
CustomerRFM$RecencyScore   <- as.integer(1 + 5 * (1 - rank(CustomerRFM$Recency) / n))
CustomerRFM$FrequencyScore <- as.integer(1 + 5 * rank(CustomerRFM$Frequency) / n)
CustomerRFM$MonetaryScore  <- as.integer(1 + 5 * rank(CustomerRFM$Monetary) / n)
head(CustomerRFM)

The customer are now assigned to one of eleven customer segments dependent on their R, F, and M scores. For example customers with R, F, and M scores of 5, 5, and 5 are in the Champions customer segment - they bought recently, buy often and spend a lot. The eleven customer segments are called: “Champions”, “Loyal Customers”, “Potential Loyalist”, “Recent Customers”, “Promising”, “Customers Needing Attention”, “About To Sleep”, “At Risk”, “Can’t Lose Them”, “Hibernating”, “Lost”.

Most of the eleven customer segments combine different R, F, and M scores. The following function computes them all:

putlerSegment <- function(row) {
    fm <- as.integer((as.integer(row["FrequencyScore"]) + as.integer(row["MonetaryScore"])) / 2) - 1
    idx = 1 + 5 * (as.integer(row["RecencyScore"]) - 1) + fm
    switch(idx,
        "Lost",
        "Lost",
        "AtRisk",
        "AtRisk",
        "CantLoseThem",
        "Lost",
        "Hibernating",
        "AtRisk",
        "AtRisk",
        "AtRisk",
        "AboutToSleep",
        "AboutToSleep",
        "NeedingAttention",
        "LoyalCustomers",
        "LoyalCustomers",
        "Promising",
        "PotentialLoyalist",
        "PotentialLoyalist",
        "LoyalCustomers",
        "LoyalCustomers",
        "RecentCustomers",
        "PotentialLoyalist",
        "PotentialLoyalist",
        "LoyalCustomers",
        "Champions")
}

and after computing the Putler Customer Segments the data looks like:

CustomerRFM$putlerSegment <- apply(CustomerRFM, 1, putlerSegment)
head(CustomerRFM)

Cleaning up the sums and balances:

CustomerRFM$RecencyScore <- NULL
CustomerRFM$FrequencyScore <- NULL
CustomerRFM$MonetaryScore <- NULL
CustomerRFM$putlerSegment <- as.factor(CustomerRFM$putlerSegment)

head(CustomerRFM)

The following two plots illustrate the customers Putler segments with each segment being a different color:

colors <- colorpalette[CustomerRFM$putlerSegment]

plot3d <- scatterplot3d(CustomerRFM$Recency, CustomerRFM$Frequency, CustomerRFM$Monetary,
                        xlab = "Recency", ylab = "Frequency", zlab = "Monetary",
                        main="Putler Customer Segments", color = colors, pch = 20)
legend(plot3d$xyz.convert(2.9, 10, 40),
       legend = levels(CustomerRFM$putlerSegment),
       col = colorpalette, pch = 16)

colors <- colorpalette[CustomerRFM$putlerSegment]

plot(CustomerRFM_PCA$x[,1], CustomerRFM_PCA$x[,2],
     xlab="PC1 (56%)", ylab = "PC2 (30%)",
     main = "Putler Customer Segments (mapped to 2D)",
     pch = 20, col = colors)
legend("topleft", legend = levels(CustomerRFM$putlerSegment),
       col = colorpalette, pch = 16)

The plots show that the Putler segments seems to differ a lot in how many customers are in each segment and especially the Champions segment seems to cover a much larger area than the other segments.

The number of customers in each Putler segment is:

CustomerRFM %>%
    group_by(putlerSegment) %>%
    summarize(NumberOfCustomers = n()) %>%
    arrange(desc(NumberOfCustomers))

Customer Segmentation using K-means Machine Learning Clustering Method

Clustering is an unsupervised machine learning technique which means that no examples of a correct solution are given beforehand. Instead the goal is to find clusters in the given data using some measure of distance. Several different algorithms exist to find clusters. I choose to use the K-means algorithm with Euclidean distance because it is easy to understand and fast enough for the use here.

K-means starts by selecting k cluster centers randomly or in some predefined pattern. In step 1 all the data points are assigned to its closest cluster center. In step 2 the cluster centers are moved to the mean of all the data points in the cluster. Step1 and 2 are repeated until the cluster centers have converged or a maximal predefined number of steps have been executed. Actually there are implementations of several variations of the basic algorithm but I just use the default.

First I want to find the optimal number of clusters to use. The goal is to have high similarity within each cluster and low similarity across clusters and I use the socalled elbow method where I plot the the percentage of explained variance against the number of clusters. At a certain point adding one more cluster does not increase the percentage of explained variance as much as before. This is the socalled elbow point and the number of clusters at the elbow point is optimal in this sense.

set.seed(9)
pExplainedVar = rep(0,20)
df <- select(CustomerRFM, Recency, Frequency, Monetary)
for (i in 2:20) {
    fit = kmeans(df, centers = i, iter.max = 15, nstart = 20)
    pExplainedVar[i] = 100 * fit$betweenss / fit$totss
}

plot(1:20, pExplainedVar, type="b", pch = 20, xlab="Number of Clusters", ylab="Percentage of Explained Variance by Clusters")
text(1:20, pExplainedVar, labels = 1:20, cex = 0.7, pos = 3)

From the above plot I think that the flattening-out of the increase begins somewhere around seven or eight and I choose to continue to find seven clusters using the K-means algorithm.

set.seed(9)
df <- select(CustomerRFM, Recency, Frequency, Monetary)
k_result <- kmeans(df, centers = 7, iter.max = 15, nstart = 25)

k_result$centers

##      Recency  Frequency    Monetary
## 1 -0.7583631  1.3418178  0.41970033
## 2 -0.5613750 -0.1462504 -0.09410573
## 3 -0.8054969  4.0383745  4.18725744
## 4 -0.8787235  6.1646198 29.13352530
## 5  2.0342762 -0.3861998 -0.18589284
## 6  0.6569728 -0.2858412 -0.14727425
## 7 -0.9001920 17.5936359  6.38494369

Adding the found clusters to the CustomerRFM and draw the same plots as for the Putler segments:

CustomerRFM$kmeansSegment <- as.factor(k_result$cluster)
head(CustomerRFM)

The following two plots illustrate the customers K-means segments with each segment being a different color:

colors <- colorpalette[CustomerRFM$kmeansSegment]

plot3d <- scatterplot3d(CustomerRFM$Recency, CustomerRFM$Frequency, CustomerRFM$Monetary,
                        xlab = "Recency", ylab = "Frequency", zlab = "Monetary",
                        main="K-means Customer Segments", color = colors, pch = 20)
legend(plot3d$xyz.convert(2.9, 10, 40),
       legend = levels(CustomerRFM$kmeansSegment),
       col = colorpalette, pch = 20)

colors <- colorpalette[CustomerRFM$kmeansSegment]

plot(CustomerRFM_PCA$x[,1], CustomerRFM_PCA$x[,2],
     xlab="PC1 (56%)", ylab = "PC2 (30%)",
     main = "K-means Customer Segments (mapped to 2D)",
     pch = 20, col = colors)
legend("topleft", legend = levels(CustomerRFM$kmeansSegment),
       col = colorpalette, pch = 16)

The plots show that the K-means segments seems like the Putler segments to differ a lot in how many customers are in each segment and in how large area they cover.

The number of customers in each K-means segment is:

CustomerRFM %>%
    group_by(kmeansSegment) %>%
    summarize(NumberOfCustomers = n()) %>%
    arrange(desc(NumberOfCustomers))

To be useful we need a business intepretation of the seven K-means clusters. I make that by looking at three projections of the three dimesional plot of the K-means clusters by leaving out one of the coordinates.

colors <- colorpalette[CustomerRFM$kmeansSegment]
plot(CustomerRFM$Recency, CustomerRFM$Frequency,
                        xlab = "Recency", ylab = "Frequency",
                        main="K-means Customer Segments", col = colors, pch = 20)
legend("topright", legend = levels(CustomerRFM$kmeansSegment),
       col = colorpalette, pch = 16)

colors <- colorpalette[CustomerRFM$kmeansSegment]
plot(CustomerRFM$Recency, CustomerRFM$Monetary,
                        xlab = "Recency", ylab = "Monetary",
                        main="K-means Customer Segments", col = colors, pch = 20)
legend("topright", legend = levels(CustomerRFM$kmeansSegment),
       col = colorpalette, pch = 16)

colors <- colorpalette[CustomerRFM$kmeansSegment]
plot(CustomerRFM$Frequency, CustomerRFM$Monetary,
                        xlab = "Frequency", ylab = "Monetary",
                        main="K-means Customer Segments", col = colors, pch = 20)
legend("topright", legend = levels(CustomerRFM$kmeansSegment),
       col = colorpalette, pch = 16)

From looking at the three two-dimsional projections of the K-means clusters I suggest the following description of the K-means segments:

Number	Name	Desciption
1	Small and Occasional	Recently active but spends low and not very often
2	New	Fairly recently active but spended low and have not been active earlier
3	Good Average	Active and buys from seldom to often and spends from low to average
4	Big Spenders	Spend most money of all and they recently did it again
5	Inactive	Not active for a long time and spended low
6	Fading Away	Not active for some time and spended low
7	Good and Loyal	Spend average or more money and buy very often

To compare with the Putler segments I computer for each K-means cluster the percentual composition of customers that are in each of the Putler clusters:

CustomerRFM %>%
    group_by(kmeansSegment) %>%
    count(putlerSegment) %>%
    mutate(percent = 100*n/sum(n))

It is seen from the table that especially K-means segment 7 overlap with many Putler segments, and it is seen that for example the Putler Champions segment is split into several of the K-means segments. All in all the two sets of segments parts the customers in very different ways.

Part 2: Analysis of Predicting Customer Segments from their First Order

Introduction

I want to predict in which segment a new customer ends up at the end of the period of the given data. I will use all the given features of the customer available at the time of the the customers first purchase. This a classification problem and it is grouped as a supervised machine learning problem because we train a model from a training set where the correct answer is given.

I train two different models for each of the K-means segment prediction and the Putler segment prediction. First I use k-nearest neighbors and then I use random forest because I want to try more than one algorithm to see how they compare. Also I compare to a simple random approach of guessing the segment randomly according to the segment sizes.

For this investigation I assume that each customers first non-cancellation order is the customers first order, even though I cannot know if the customer has been a customer for years before the first date in my dataset.

The Class Imbalance Problem

We saw earlier that the number of customers in the Putler segments varies from a few to about 1000, and that the numbers of customers in the Kmeans segments varies from a few to about 2500. The large difference of the class sizes, the imbalanced classes, makes it very possible that the classification algorithms optimises the accuracy at the expense of the precision, and we risk an algorithm that always predicts the most populated category.

Different approaches to prevent this exists, but it is not the purpose of this report to compare different possible solution to the class imbalance problem and I just choose a method that often works and after the training of the models I check if the predictions are spread on several categories. The approach I choose is oversampling of the small categories. It is chosen in the caret train method by adding sampling = "up" to the trainControl.

Feature extraction

First I construct a data frame that holds information about the first purchase for each customer. I ignore all the cancellations because I only want the first non-cancellation order for each customer. In the process we loose about twenty customers who only had cancellations in the dataset.

firstOrder <- onlineRetail %>%
    filter(!startsWith(tolower(as.character(InvoiceNo)), "c")) %>%
    group_by(CustomerID, InvoiceNo) %>%
    summarize(N = n(), TotalQuantity = sum(Quantity),
              TimeOfDay = (60 * hour(min(InvoiceDate)) + minute(min(InvoiceDate))),
              TotalPrice = sum(Quantity*UnitPrice), Country = first(Country),
              InvoiceDate = min(InvoiceDate)) %>%
    slice(which.min(InvoiceDate)) %>%
    ungroup() %>%
    inner_join(CustomerRFM, by = "CustomerID") %>%
    select(-CustomerID, -Recency, -Frequency, -Monetary, -InvoiceDate, -InvoiceNo)

print(nrow(firstOrder))

## [1] 4334

head(firstOrder)

Because I want to use the k-nearest neighbor algorithm, see more about this in a moment, and because the k-nearest neighbor algorithm use the distance between points I need to scale the numeric data before I train the models. Also I change the Country feature to dummy variable, aka one hot encoding, to make it numeric:

dummy <- dummyVars("~ Country", data = firstOrder, sep = NULL)
firstOrderEncoded <- cbind(firstOrder, predict(dummy, newdata = firstOrder))

firstOrderEncoded <- firstOrderEncoded %>%
    select(-Country) %>%
    mutate_if(is.numeric, function(clm) { as.vector(scale(clm)) })

glimpse(firstOrderEncoded)

## Observations: 4,334
## Variables: 43
## $ N                         <dbl> 0.29595976, -0.33561666, 2.02226866, -…
## $ TotalQuantity             <dbl> 0.17380512, 2.42427831, 0.92719281, -0…
## $ TimeOfDay                 <dbl> 0.80532341, 2.59495502, -1.36780069, 1…
## $ TotalPrice                <dbl> 0.560181157, 0.453026216, 1.914849214,…
## $ putlerSegment             <fct> Champions, AtRisk, PotentialLoyalist, …
## $ kmeansSegment             <fct> 2, 2, 2, 5, 2, 6, 5, 6, 2, 2, 2, 2, 2,…
## $ CountryAustralia          <dbl> -0.04561189, -0.04561189, -0.04561189,…
## $ CountryAustria            <dbl> -0.04561189, -0.04561189, -0.04561189,…
## $ CountryBahrain            <dbl> -0.02148427, -0.02148427, -0.02148427,…
## $ CountryBelgium            <dbl> -0.07461341, -0.07461341, -0.07461341,…
## $ CountryBrazil             <dbl> -0.01518992, -0.01518992, -0.01518992,…
## $ CountryCanada             <dbl> -0.03039037, -0.03039037, -0.03039037,…
## $ CountryChannelIslands     <dbl> -0.04561189, -0.04561189, -0.04561189,…
## $ CountryCyprus             <dbl> -0.04021661, -0.04021661, -0.04021661,…
## $ CountryCzechRepublic      <dbl> -0.01518992, -0.01518992, -0.01518992,…
## $ CountryDenmark            <dbl> -0.04299833, -0.04299833, -0.04299833,…
## $ CountryEIRE               <dbl> -0.02631579, -0.02631579, -0.02631579,…
## $ CountryEuropeanCommunity  <dbl> -0.01518992, -0.01518992, -0.01518992,…
## $ CountryFinland            <dbl> -0.05268635, 18.97586794, -0.05268635,…
## $ CountryFrance             <dbl> -0.1431095, -0.1431095, -0.1431095, -0…
## $ CountryGermany            <dbl> -0.1488781, -0.1488781, -0.1488781, -0…
## $ CountryGreece             <dbl> -0.03039037, -0.03039037, -0.03039037,…
## $ CountryIceland            <dbl> 65.81793244, -0.01518992, -0.01518992,…
## $ CountryIsrael             <dbl> -0.02631579, -0.02631579, -0.02631579,…
## $ CountryItaly              <dbl> -0.05692094, -0.05692094, 17.56417464,…
## $ CountryJapan              <dbl> -0.04299833, -0.04299833, -0.04299833,…
## $ CountryLebanon            <dbl> -0.01518992, -0.01518992, -0.01518992,…
## $ CountryLithuania          <dbl> -0.01518992, -0.01518992, -0.01518992,…
## $ CountryMalta              <dbl> -0.02148427, -0.02148427, -0.02148427,…
## $ CountryNetherlands        <dbl> -0.04561189, -0.04561189, -0.04561189,…
## $ CountryNorway             <dbl> -0.04808472, -0.04808472, -0.04808472,…
## $ CountryPoland             <dbl> -0.03722904, -0.03722904, -0.03722904,…
## $ CountryPortugal           <dbl> -0.06634929, -0.06634929, -0.06634929,…
## $ CountryRSA                <dbl> -0.01518992, -0.01518992, -0.01518992,…
## $ CountrySaudiArabia        <dbl> -0.01518992, -0.01518992, -0.01518992,…
## $ CountrySingapore          <dbl> -0.01518992, -0.01518992, -0.01518992,…
## $ CountrySpain              <dbl> -0.08062912, -0.08062912, -0.08062912,…
## $ CountrySweden             <dbl> -0.04299833, -0.04299833, -0.04299833,…
## $ CountrySwitzerland        <dbl> -0.06808083, -0.06808083, -0.06808083,…
## $ CountryUnitedArabEmirates <dbl> -0.02148427, -0.02148427, -0.02148427,…
## $ CountryUnitedKingdom      <dbl> -3.060435, -3.060435, -3.060435, -3.06…
## $ CountryUnspecified        <dbl> -0.03039037, -0.03039037, -0.03039037,…
## $ CountryUSA                <dbl> -0.03039037, -0.03039037, -0.03039037,…

Using the R caret package I first split my data into test and training sets with 70% of the original data being the training set and 30% being the test set:

set.seed(9)
test_index <- createDataPartition(firstOrderEncoded$kmeansSegment, times = 1, p = 0.3, list = FALSE)

test_set <- firstOrderEncoded[test_index, ]
train_set <- firstOrderEncoded[-test_index, ]

cat("Size of test_set =", nrow(test_set), "\n")

## Size of test_set = 1303

cat("Size of train_set =", nrow(train_set), "\n")

## Size of train_set = 3031

Random Guessing for Putler Segments

As a baseline I use a simple predictor that for any customer always predicts randomly according to the sizes of each segment. First for the Putler segments:

putlerSegments <- CustomerRFM %>%
    group_by(putlerSegment) %>%
    summarize(NumberOfCustomers = n())

and when predicting the testset we get the following accuracy:

set.seed(9)
idx <- sample(nrow(putlerSegments), size = nrow(test_set), replace=TRUE, prob = putlerSegments$NumberOfCustomers)
test_pred <- putlerSegments[idx, ]$putlerSegment
accuracy <- mean(test_pred == test_set$putlerSegment)
accuracy_random_onPutler = sprintf("%.2f", accuracy)
accuracy_random_onPutler

## [1] "0.13"

Random Guessing for Kmeans Segments

The baseline for the Kmeans segments is:

kmeansSegments <- CustomerRFM %>%
    group_by(kmeansSegment) %>%
    summarize(NumberOfCustomers = n())

and when predicting the testset we get the following accuracy:

set.seed(9)
idx <- sample(nrow(kmeansSegments), size = nrow(test_set), replace=TRUE, prob = kmeansSegments$NumberOfCustomers)
test_pred <- kmeansSegments[idx, ]$kmeansSegment
accuracy <- mean(test_pred == test_set$kmeansSegment)
accuracy_random_onKmeans = sprintf("%.2f", accuracy)
accuracy_random_onKmeans

## [1] "0.35"

Train and test K-nearest Neigbor Algorithm for Putler Segments

Training

I use the caret package that takes care of cross validation and running on different values of k, the number of neighbors to look at, to find the best value. First I try to find the best k for Putler segments using 5-fold cross validation:

set.seed(9)
trained_knn_onPutler <- train(putlerSegment ~ .,
                              method = "knn",
                              data = select(train_set, -kmeansSegment),
                              trControl = trainControl(method = "cv", number = 5),
                              tuneGrid = data.frame(k = seq(2, 150, 2)))

plot(trained_knn_onPutler, pch = 20)

The algorithm finds the best k = 2 which I find is surprisingly low.

Results

Using k = 2 I now find the accuracy when trying to predict the testset:

test_pred <- predict(trained_knn_onPutler, newdata = test_set)
accuracy <- mean(test_pred == test_set$putlerSegment)
accuracy_knn_onPutler = sprintf("%.2f", accuracy)
accuracy_knn_onPutler

## [1] "0.14"

Looking at the Confusion Matrix it is seen that the method predicts all segments and in different frequency as expected and there seems to be no problems with class imbalance:

confusionMatrix(test_pred, test_set$putlerSegment)$table

##                    Reference
## Prediction          AboutToSleep AtRisk CantLoseThem Champions Hibernating
##   AboutToSleep                18     15            0        12          14
##   AtRisk                      16     24            0        15          14
##   CantLoseThem                 0      1            0         0           0
##   Champions                   14     15            0        14           7
##   Hibernating                  8     17            0        13           4
##   Lost                        16      8            0         9           8
##   LoyalCustomers              10     19            0        20          11
##   NeedingAttention             5     10            0         9           4
##   PotentialLoyalist           21     24            0        14          14
##   Promising                    8      2            0         2           5
##   RecentCustomers              2      1            0         4           2
##                    Reference
## Prediction          Lost LoyalCustomers NeedingAttention PotentialLoyalist
##   AboutToSleep        36             33                3                26
##   AtRisk              38             36                6                23
##   CantLoseThem         2              0                0                 0
##   Champions           27             25                8                21
##   Hibernating         13             28                2                14
##   Lost                44             25                6                23
##   LoyalCustomers      43             46               12                29
##   NeedingAttention    19             17                2                10
##   PotentialLoyalist   49             29                7                25
##   Promising           21              9                2                11
##   RecentCustomers      5              3                2                 3
##                    Reference
## Prediction          Promising RecentCustomers
##   AboutToSleep             11               6
##   AtRisk                    4               3
##   CantLoseThem              0               0
##   Champions                 4               1
##   Hibernating               2               1
##   Lost                      5               5
##   LoyalCustomers            6               2
##   NeedingAttention          1               1
##   PotentialLoyalist         7               5
##   Promising                 5               0
##   RecentCustomers           2               0

Train and Test K-nearest Neigbor Algorithm for Kmeans Segments

Training

Again I first try to find the best k for Kmeans segments using 5-fold cross validation:

set.seed(9)
trained_knn_onKmeans <- train(kmeansSegment ~ .,
                              method = "knn",
                              data = select(train_set, -putlerSegment),
                              trControl = trainControl(method = "cv", number = 5),
                              tuneGrid = data.frame(k = seq(2, 150, 2)))

plot(trained_knn_onKmeans, pch = 20)

As for the Putler segments the best k is surprisingly low at k = 2.

Results

Using k = 2 I now find the accuracy when trying to predict the testset:

test_pred <- predict(trained_knn_onKmeans, newdata = test_set)
accuracy <- mean(test_pred == test_set$kmeansSegment)
accuracy_knn_onKmeans = sprintf("%.2f", accuracy)
accuracy_knn_onKmeans

## [1] "0.31"

Looking at the Confusion Matrix it is seen that the method predicts all segments and in different frequency as expected and there seems to be no problems with class imbalance:

confusionMatrix(test_pred, test_set$kmeansSegment)$table

##           Reference
## Prediction   1   2   3   4   5   6   7
##          1  20  82   3   0  15  32   0
##          2  49 276   4   0  78  75   1
##          3   0   9   0   0   1   3   0
##          4   0   1   0   0   0   0   0
##          5  22 169   2   1  37  52   0
##          6  30 218   4   0  47  70   1
##          7   0   1   0   0   0   0   0

Train and Test Random Forest Algorithm for Putler Segments

Training

A Random Forest works by building many different Decision Trees during training. The predictions are made by majority vote by the decision trees.

The Random Forest algorithm has several parameters - the one that is said to have most influence on the result is mtry that is the number of variables randomly sampled as candidates at each split of decision tress.

Using 5-fold cross validation and the default of 500 decision trees the training is:

set.seed(9)
trained_Forest_onPutler <- train(putlerSegment ~ .,
                              method = "rf",
                              data = select(train_set, -kmeansSegment),
                              trControl = trainControl(method = "cv", number = 5),
                              tuneGrid = data.frame(mtry = seq(1,15,1)))

plot(trained_Forest_onPutler, pch = 20)

Results

The algorithm found the best mtry = 15 but it is at the end of my search interval. From different other peoples experience I expected mtry to be lower than 15 but I was wrong. Because the training is very slow and because the curve seems to flatten at 15 I choose to keep mtry = 15. Using this value I now find the accuracy when trying to predict the testset:

test_pred <- predict(trained_Forest_onPutler, newdata = test_set)
accuracy <- mean(test_pred == test_set$putlerSegment)
accuracy_Forest_onPutler = sprintf("%.2f", accuracy)
accuracy_Forest_onPutler

## [1] "0.18"

Looking at the Confusion Matrix it is seen that the method predicts all segments and in different frequency as expected and there seems to be no problems with class imbalance:

confusionMatrix(test_pred, test_set$putlerSegment)$table

##                    Reference
## Prediction          AboutToSleep AtRisk CantLoseThem Champions Hibernating
##   AboutToSleep                 9     16            0        12           9
##   AtRisk                       9     17            0        13          10
##   CantLoseThem                 0      0            0         0           1
##   Champions                   15     20            0        23           8
##   Hibernating                  3      9            0         6           2
##   Lost                        25     21            0        19          18
##   LoyalCustomers              13     25            0        22          15
##   NeedingAttention            10      8            0         7           3
##   PotentialLoyalist           12     11            0         7          11
##   Promising                   15      6            0         3           3
##   RecentCustomers              7      3            0         0           3
##                    Reference
## Prediction          Lost LoyalCustomers NeedingAttention PotentialLoyalist
##   AboutToSleep        24             27                6                28
##   AtRisk              17             31                4                12
##   CantLoseThem         1              1                0                 0
##   Champions           34             25                7                19
##   Hibernating          7             19                0                 8
##   Lost                91             38               11                35
##   LoyalCustomers      31             58                9                31
##   NeedingAttention    21             10                2                12
##   PotentialLoyalist   35             20                7                17
##   Promising           25             15                2                14
##   RecentCustomers     11              7                2                 9
##                    Reference
## Prediction          Promising RecentCustomers
##   AboutToSleep             10               4
##   AtRisk                    2               1
##   CantLoseThem              0               0
##   Champions                 1               0
##   Hibernating               0               1
##   Lost                      7               6
##   LoyalCustomers            4               3
##   NeedingAttention          5               0
##   PotentialLoyalist         5               4
##   Promising                 9               2
##   RecentCustomers           4               3

Train and test Random Forest Algorithm for Kmeans Segments

Training

For the Kmeans segments I train:

set.seed(9)
trained_Forest_onKmeans <- train(kmeansSegment ~ .,
                              method = "rf",
                              data = select(train_set, -putlerSegment),
                              trControl = trainControl(method = "cv", number = 5),
                              tuneGrid = data.frame(mtry = seq(1,15,1)))

plot(trained_Forest_onKmeans, pch = 20)

Results

Again the algorithm found the best mtry = 15 and as before I choose to keep mtry = 15. Using this value I now find the accuracy when trying to predict the testset:

test_pred <- predict(trained_Forest_onKmeans, newdata = test_set)
accuracy <- mean(test_pred == test_set$kmeansSegment)
accuracy_Forest_onKmeans = sprintf("%.2f", accuracy)
accuracy_Forest_onKmeans

## [1] "0.39"

Looking at the Confusion Matrix it is seen that the method predicts all segments and in different frequency as expected and there seems to be no problems with class imbalance:

confusionMatrix(test_pred, test_set$kmeansSegment)$table

##           Reference
## Prediction   1   2   3   4   5   6   7
##          1  21  83   2   0  15  19   0
##          2  66 417   5   0  89 124   1
##          3   0  12   0   0   2   4   0
##          4   0   1   0   0   0   0   0
##          5  20 136   4   1  40  52   0
##          6  14 106   2   0  32  33   1
##          7   0   1   0   0   0   0   0

Collected Results of Predicting

Method	Putler Segments Accuracy	Kmeans Segments Accurary
Random	0.13	0.35
K-nearest neighbor	0.14	0.31
Random Forest	0.18	0.39

Conclusion

Customer Segments

The traditional, fixed RFM analysis that I used defines eleven different customer segments. The Machine Learning generated customer segments turn out to have only seven customer segments. The analysis shows that the two sets of customer segments splits the customers in very different ways.

The machine learning method created a reasonable sized set of segments that when interpreted seems to make good business sense. It is, however, impossible to say which set of the segment sets that are best because it of course depends on the intended use. If we want a set of segments that is good at describing the states a customer goes through during time, it might be possible to make an analysis of the development over time of customers. In this report I assumed that the customers were close to static over the year of the dataset. I belive that finding dynamic trends or customer lifecycles requires a dataset that covers a longer period of at least several years for a business like the one, I analysed in this report.

Predicting Customer Segment

Using the K-nearest neigbor algorithm to predict the future customer segment from customers first order and their country did worse than random for the Kmeans segment! However, the Random Forest algorithm predicted the future customer segment from customers first order and their country a little better than random for both segment sets. Even though better than random, the accuracy is still low for both segment sets. Predicting using Random Forest might, though, be a beginning of helping the business to a finer segmentation of new customers.

It is clearly easier to predict the Kmeans segments - this is quite naturally because of the fewer segments and the very high number of customers in a single segment.

In this report I did not use the information of the content of the first order. It would be interesting to investigate if using this data, or perhaps even making a Market Basket Analysis, could help increase the accuracy of the predicting. Also I should refine the Random Forest analyses to be sure to find the really best value of mtry.

Data Science Professional Certificate/HarvardX Capstone Project 2: Online Retail

Paw Hermansen

August 21, 2019

Executive Summary

The Data

The Goals

Part 1: Customer Segmentation (Clustering)

Part 2: Predicting Customer Segment for new Customers (Classification)

The Results

The Result Part 1: Customer Segmentation

The Result Part 2: Predicting Customer Segment

Conclusion

Customer Segmentation

Predicting Customer Segment

The Data and the Goal

Data Exploration and Cleaning

Basic Information

Explore and Clean the Data

Missing data

InvoiceNo

StockCode

Cancelled Invoices

Quantity

InvoiceDate

UnitPrice

CustomerID

Country

The Data Types

Sizes after Cleaning the Data

Part 1: Analysis of Customer RFM and Customer Segments

Introduction

Calculate RFM for each Customer

Visually Illustrating the Values and Principal Component Analysis

Customer Segmentation using a Traditional Fixed Method

Customer Segmentation using K-means Machine Learning Clustering Method

Part 2: Analysis of Predicting Customer Segments from their First Order

Introduction

The Class Imbalance Problem

Feature extraction

Random Guessing for Putler Segments

Random Guessing for Kmeans Segments

Train and test K-nearest Neigbor Algorithm for Putler Segments

Training

Results

Train and Test K-nearest Neigbor Algorithm for Kmeans Segments

Training

Results

Train and Test Random Forest Algorithm for Putler Segments

Training

Results

Train and test Random Forest Algorithm for Kmeans Segments

Training

Results

Collected Results of Predicting

Conclusion

Customer Segments

Predicting Customer Segment