I am a fan of Thomas Davenport, who writes about advanced business analytics for publishers such as the Harvard Business Review and Forbes. Davenport has a simple, uncomplicated perspective of analytics moving from classical business intelligence (descriptive summaries and simple charts), through to predictive and prescriptive analytics (predicting the future and making decisions based on these predictions respectively). There is a theme of evolution in business intelligence (BI) here that are similarly reflected in the various reports from Gartner , material from SAS , whitepapers from EY and even in the emerging sales pitch for Microsoft Azure and Azure Machine Learning.
Across all of these media, there is a consistent emphasis on prediction; the prediction of customer churn, the prediction of house prices, the prediction of credit scores etc... However, prediction alone is a pretty mindless exercise, which often fails to highlight the factors which are truly driving business trends. There is a large bridge to build between raw prediction and business insight / decision making.
In this post, I want to try and give some substance to this idea and demonstrate how exploratory analysis can help cross that gap between typical business intelligence and predictive analytics. I will use a time series dataset from AdventureWorksDW2012 to demonstrate the appeal of very simple exploratory techniques to build upon simple charts and help uncover patterns and relationships in this data that are not revealed via predictive analytics.
To be concise, I am less interested in predicting next month's sales, than I am in understanding which products are selling well, which are underperforming and whether there are clear groups of products which behave similarly. Whilst I am a huge fan of machine learninghopefully, this post will remain as a reminder to myself, and others, that simple exploratory techniques can yield a wealth of information.
The Data & Classical BIFor this post, we are going to work with a time series dataset from AdventureWorksDW2012. This dataset plays a central role in Microsoft's Data Mining Tutorial . Everything we do here will be in R. I have created an ODBC connection to the server which hosts this data warehouse, and can extract the data as shown below:
library(RODBC) conn <- odbcConnect("R-SQL") vTimeSeries <- sqlQuery(conn, "select * from AdventureWorksDW2012.dbo.vTimeSeries") odbcClose(conn)The data describes the monthly sales of 4 different products, across 3 regions between 2005 and 2008. The raw data looks like this:
Monthly sales by region ModelRegion TimeIndex Quantity Amount CalendarYear Month ReportingDate R750 North America 200712 62 33479.38 2007 12 2007-12-25 M200 Europe 200604 8 27099.92 2006 4 2006-04-25 R750 Europe 200802 35 18899.65 2008 2 2008-02-25 R250 Europe 200803 16 39093.60 2008 3 2008-03-25 R250 Europe 200604 27 96613.29 2006 4 2006-04-25The goal of Microsoft's tutorial is to predict future sales. However, we intend to dig a little deeper and describe the sales of each product across all 3 regions. To begin with, it seems sensible to plot the trend over time:
library(ggplot2) ggplot(vTimeSeries, aes(x = factor(TimeIndex), y = Amount / 1000, group = ModelRegion)) + geom_point(colour = "darkblue") + geom_line(colour = "darkblue", alpha = 0.5) + facet_wrap(~ ModelRegion, ncol = 3) + theme_minimal() + theme(panel.grid = element_blank(), axis.text.x = element_blank()) + ylab("$ (x 1000)") + xlab("Time") + ggtitle("Product Sales by Region") + guides(group = "none", colour = "none")This code produces the graph below.
Image may be NSFW.
Clik here to view.

This is clean, nice data (which we would expect from a tutorial). All the same, there are some interesting trends above. We can see that the M200 product line shows sales growth in Europe and North America. There has been a sharp decline in the sales of the R250 product in North America, and less extreme in Europe and the Pacific. The sales of R750 look like they have slowly increased in all three regions. The T1000 product has performed quite well since its introduction in the later part of this period. There are some interesting peaks and troughs which could be explored with the business, but let's try to capture the major sales trends.
Product ClusteringIt's great that this dataset is so clean, and that the trends are so easy to see. This will make it easier for us to understand the following two plots and help come to terms with what they represent. In any dataset, there are three questions that I really want to know about: what items (products in this case) are similar, which are dissimilar and what is it that separates groups of products from each other? We are going to use hierarchical clustering to provide some initial answers to these questions, and principal components analysis to visualise the contrasts.
Currently, the data is held in a long format: one row per observation. This is ideal for machine learning algorithms, but not for most exploratory methods. We have flipped this data aroundso that there is one row per product / region, and separate columns for every time point. In the end, we have a matrix with 12 rows (4 products x 3 regions) and 36 columns (monthly data over 3 years). To keep this post short and to the point, I have excluded the code for this. But there are many ways of achieving this.
To identify groups of similar / dissimilar products, we are going to use hierarchical clustering. Hierarchical clustering builds a tree, where similar products will end up in the same branches and dissimilar products will end up in other branches. We are going to use R's heatmap function for this, and ask it to compare products using a correlation distance (note, the default is euclidean distance, but as this is longitudinal data, a correlation dista