Telco Customer Churn with R in SQL Server 2016

by Fang Zhou, Data Scientist at Microsoft andWee Hyong Tok, Senior Data Scientist Manager at Microsoft

In today’s fast-paced world, mobile phone customers have many choices and can easily switch between service providers. Improving customer attrition rates and enhancing a customer’s experience are valuable ways to reduce customer acquisition costs and maintain a high-quality service.

Many industries, including mobile phone service providers, use churn models to predict which customers are most likely to leave, and to understand which factors cause customers to stop using their service.In this blog post, we sketcha solution to help providers, especially telecommunication companies, predict customer churn. This solution provides guidance on how to harness advanced analytics in SQL Server 2016 , leveraging its new feature R Services and familiar SQL tools and technologies. To be specific, it covers the following phases:

How a data scientist can make use of SQL Server R Services to do data science, and How to operationalize the telco churn models via T-SQL stored procedures and build an intelligent application.

The usage and comparison of CRAN R and RevoScaleR packages are also highlighted here.

Data Science with R Services

In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process -- data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation -- in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried various tree-based models along with methods like bagging , random forests and boosting . Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE ( Synthetic Minority Oversampling Technique ) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.

All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data byleveraging the in-database analytics capability provided by SQL Server R Services , taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.

In order to explore and visualize the data, we could leverage a branch of CRAN R packages for manipulation and visualization, such as dplyr, ggplot and d3heatmap. To avoid thememory limitations of CRAN R and avoid moving the raw data out from the database, we instead usedRevoScaleR functions for data aggregation. The rxCube and rxCrossTabs functions, used under the in-sql compute context, is all that's needed firthe visualization work. Below is a sample interactive visualization dashboard created by adding the power of leaflet and shiny .

Telco Customer Churn with R in SQL Server 2016

These graphs not only helped us get a visual understanding of the raw dataset but also provided insights in choosing the right methodologies for the subsequent analysis. For instance, the rarity of churn events (a proportion of 9% among all observations) indicated by the pie chart ledus to a pursue data rebalancing methods like SMOTE before modeling. The impact of education, call failure rate, state or number of complaints on customer churn shown by the bar chart or the map implied that they are highly potential influencers. During the data preparation and feature engineering step, we split the data into training and test datasets in a ratio of 7:3, but more importantly, implemented the SMOTE on the training dataset by using the function ubSMOTE from the R package “ unbalanced ”. This yielded a churn proportion of 23% among all the training data. After that, we tried different sets of tree-based algorithms in order to build a good classification model, including CRAN R algorithms - randomForest and xgboost , and RevoScaleR algorithms rxDForest and rxBTrees . In terms of model performance, we found that xgboost outperforms all the others, but RevoScaleR algorithms take advantage of in-database analytics. The confusion table and model performance metrics for xgboost

model are as shown below.

Non-churn

churn

Predicted non-churn

5138

405

Predicted churn

300

279

Confusion Table for XGBoost. Note: Accuracy=0.885; Precision=0.407; Recall=0.482; F-Score=0.441; AUC=0.85

The comparison among different tree-based algorithms are summarized in the following table.

Package Algorithm Model Performance Metrics Computation Location Accuracy Precision Recall F-score AUC

CRAN R

randomForest

0.907

0.552

0.117

0.194

0.76

Memory

xgboost

0.885

0.407

0.482

0.441

0.85

Memory

ScaleR

rxDForest

0.903

0.475

0.145

Telco Customer Churn with R in SQL Server 2016

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本