Beyond prediction with R & SQL Server

I am a fan of Thomas Davenport, who writes about advanced business analytics for publishers such as the Harvard Business Review and Forbes. Davenport has a simple, uncomplicated perspective of analytics moving from classical business intelligence (descriptive summaries and simple charts), through to predictive and prescriptive analytics (predicting the future and making decisions based on these predictions respectively). There is a theme of evolution in business intelligence (BI) here that are similarly reflected in the various reports from Gartner , material from SAS , whitepapers from EY and even in the emerging sales pitch for Microsoft Azure and Azure Machine Learning.

Across all of these media, there is a consistent emphasis on prediction; the prediction of customer churn, the prediction of house prices, the prediction of credit scores etc... However, prediction alone is a pretty mindless exercise, which often fails to highlight the factors which are truly driving business trends. There is a large bridge to build between raw prediction and business insight / decision making.

In this post, I want to try and give some substance to this idea and demonstrate how exploratory analysis can help cross that gap between typical business intelligence and predictive analytics. I will use a time series dataset from AdventureWorksDW2012 to demonstrate the appeal of very simple exploratory techniques to build upon simple charts and help uncover patterns and relationships in this data that are not revealed via predictive analytics.

To be concise, I am less interested in predicting next month's sales, than I am in understanding which products are selling well, which are underperforming and whether there are clear groups of products which behave similarly. Whilst I am a huge fan of machine learninghopefully, this post will remain as a reminder to myself, and others, that simple exploratory techniques can yield a wealth of information.

The Data & Classical BI

For this post, we are going to work with a time series dataset from AdventureWorksDW2012. This dataset plays a central role in Microsoft's Data Mining Tutorial . Everything we do here will be in R. I have created an ODBC connection to the server which hosts this data warehouse, and can extract the data as shown below:

library(RODBC) conn <- odbcConnect("R-SQL") vTimeSeries <- sqlQuery(conn, "select * from AdventureWorksDW2012.dbo.vTimeSeries") odbcClose(conn)

The data describes the monthly sales of 4 different products, across 3 regions between 2005 and 2008. The raw data looks like this:

Monthly sales by region ModelRegion TimeIndex Quantity Amount CalendarYear Month ReportingDate R750 North America 200712 62 33479.38 2007 12 2007-12-25 M200 Europe 200604 8 27099.92 2006 4 2006-04-25 R750 Europe 200802 35 18899.65 2008 2 2008-02-25 R250 Europe 200803 16 39093.60 2008 3 2008-03-25 R250 Europe 200604 27 96613.29 2006 4 2006-04-25

The goal of Microsoft's tutorial is to predict future sales. However, we intend to dig a little deeper and describe the sales of each product across all 3 regions. To begin with, it seems sensible to plot the trend over time:

library(ggplot2) ggplot(vTimeSeries, aes(x = factor(TimeIndex), y = Amount / 1000, group = ModelRegion)) + geom_point(colour = "darkblue") + geom_line(colour = "darkblue", alpha = 0.5) + facet_wrap(~ ModelRegion, ncol = 3) + theme_minimal() + theme(panel.grid = element_blank(), axis.text.x = element_blank()) + ylab("$ (x 1000)") + xlab("Time") + ggtitle("Product Sales by Region") + guides(group = "none", colour = "none")

This code produces the graph below.

This is clean, nice data (which we would expect from a tutorial). All the same, there are some interesting trends above. We can see that the M200 product line shows sales growth in Europe and North America. There has been a sharp decline in the sales of the R250 product in North America, and less extreme in Europe and the Pacific. The sales of R750 look like they have slowly increased in all three regions. The T1000 product has performed quite well since its introduction in the later part of this period. There are some interesting peaks and troughs which could be explored with the business, but let's try to capture the major sales trends.

Product Clustering

It's great that this dataset is so clean, and that the trends are so easy to see. This will make it easier for us to understand the following two plots and help come to terms with what they represent. In any dataset, there are three questions that I really want to know about: what items (products in this case) are similar, which are dissimilar and what is it that separates groups of products from each other? We are going to use hierarchical clustering to provide some initial answers to these questions, and principal components analysis to visualise the contrasts.

Currently, the data is held in a long format: one row per observation. This is ideal for machine learning algorithms, but not for most exploratory methods. We have flipped this data aroundso that there is one row per product / region, and separate columns for every time point. In the end, we have a matrix with 12 rows (4 products x 3 regions) and 36 columns (monthly data over 3 years). To keep this post short and to the point, I have excluded the code for this. But there are many ways of achieving this.

To identify groups of similar / dissimilar products, we are going to use hierarchical clustering. Hierarchical clustering builds a tree, where similar products will end up in the same branches and dissimilar products will end up in other branches. We are going to use R's heatmap function for this, and ask it to compare products using a correlation distance (note, the default is euclidean distance, but as this is longitudinal data, a correlation dista

Beyond prediction with R & SQL Server

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本