Quantcast
Channel: CodeSection,代码区,SQL Server(mssql)数据库 技术分享 - CodeSec
Viewing all articles
Browse latest Browse all 3160

Telco Customer Churn with R in SQL Server 2016

$
0
0

by Fang Zhou, Data Scientist at Microsoft andWee Hyong Tok, Senior Data Scientist Manager at Microsoft

In today’s fast-paced world, mobile phone customers have many choices and can easily switch between service providers. Improving customer attrition rates and enhancing a customer’s experience are valuable ways to reduce customer acquisition costs and maintain a high-quality service.

Many industries, including mobile phone service providers, use churn models to predict which customers are most likely to leave, and to understand which factors cause customers to stop using their service.In this blog post, we sketcha solution to help providers, especially telecommunication companies, predict customer churn. This solution provides guidance on how to harness advanced analytics in SQL Server 2016 , leveraging its new feature R Services and familiar SQL tools and technologies. To be specific, it covers the following phases:

How a data scientist can make use of SQL Server R Services to do data science, and How to operationalize the telco churn models via T-SQL stored procedures and build an intelligent application.

The usage and comparison of CRAN R and RevoScaleR packages are also highlighted here.

Data Science with R Services

In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process -- data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation -- in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried various tree-based models along with methods like bagging , random forests and boosting . Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE ( Synthetic Minority Oversampling Technique ) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.

All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data byleveraging the in-database analytics capability provided by SQL Server R Services , taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.

In order to explore and visualize the data, we could leverage a branch of CRAN R packages for manipulation and visualization, such as dplyr, ggplot and d3heatmap. To avoid thememory limitations of CRAN R and avoid moving the raw data out from the database, we instead usedRevoScaleR functions for data aggregation. The rxCube and rxCrossTabs functions, used under the in-sql compute context, is all that's needed firthe visualization work. Below is a sample interactive visualization dashboard created by adding the power of leaflet and shiny .


Telco Customer Churn with R in SQL Server 2016
These graphs not only helped us get a visual understanding of the raw dataset but also provided insights in choosing the right methodologies for the subsequent analysis. For instance, the rarity of churn events (a proportion of 9% among all observations) indicated by the pie chart ledus to a pursue data rebalancing methods like SMOTE before modeling. The impact of education, call failure rate, state or number of complaints on customer churn shown by the bar chart or the map implied that they are highly potential influencers. During the data preparation and feature engineering step, we split the data into training and test datasets in a ratio of 7:3, but more importantly, implemented the SMOTE on the training dataset by using the function ubSMOTE from the R package “ unbalanced ”. This yielded a churn proportion of 23% among all the training data. After that, we tried different sets of tree-based algorithms in order to build a good classification model, including CRAN R algorithms - randomForest and xgboost , and RevoScaleR algorithms rxDForest and rxBTrees . In terms of model performance, we found that xgboost outperforms all the others, but RevoScaleR algorithms take advantage of in-database analytics. The confusion table and model performance metrics for xgboost

model are as shown below.

Non-churn

churn

Predicted non-churn

5138

405

Predicted churn

300

279

Confusion Table for XGBoost. Note: Accuracy=0.885; Precision=0.407; Recall=0.482; F-Score=0.441; AUC=0.85

The comparison among different tree-based algorithms are summarized in the following table.

Package Algorithm Model Performance Metrics Computation Location Accuracy Precision Recall F-score AUC

CRAN R

randomForest

0.907

0.552

0.117

0.194

0.76

Memory

xgboost

0.885

0.407

0.482

0.441

0.85

Memory

ScaleR

rxDForest

0.903

0.475

0.145


Viewing all articles
Browse latest Browse all 3160

Trending Articles