Quantcast
Channel: CodeSection,代码区,SQL Server(mssql)数据库 技术分享 - CodeSec
Viewing all articles
Browse latest Browse all 3160

Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple d ...

$
0
0

Long I wanted to test a simple data manipulation tasks and compare the execution time, ease of writing the code and simplicity between T-SQL and R package for data manipulation.

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R package dplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

USE WideWorldImportersDW; GO SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN [Dimension].[City] AS c ON c.[City Key] = fs.[City Key];

For the metrics of comparison I will be using simple time statistics (metrics:elapsed time) and every code will be run from SSMS.

-- Example SET STATISTICS TIME ON SELECT GETDATE() SET STATISTICS TIME OFF SELECT

First simple test is just printing the data.

PRINT ' ' PRINT 'STATISTICS WITH T-SQL' PRINT ' ' SET STATISTICS TIME ON SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key] SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N'OutputDataSet <- InputDataSet' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATE ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFF

This test is very much simple. Outputting the data. In case to T-SQL it is just a simple SELECT statement and in case of R-TSQL is just passing data to R Engine and returning in a data-frame. The test itself is trivial and “absurd” but very important to get the base-line and time related results.


Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple d ...

Running the query couple of times on my client, I have some base comparison. T-SQL outperforms R-TSQL and it is almost two-times faster. I would expect higher difference in terms of time, because it involves a lot of data migration and no computations.

Filtering / WHERE

Now let’s perform a simple WHERE clause or some simple filtering.

PRINT ' ' PRINT 'STATISTICS WITH T-SQL' PRINT ' ' -- SIMPLE T-SQL SET STATISTICS TIME ON SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key] WHERE [Sales Territory] <> 'Southeast' AND fs.[Customer Key] <> 0 AND Profit <= 1000 SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R' PRINT ' ' -- R Package dplyr and T-SQL SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N' library(dplyr) OutputDataSet <- InputDataSet %>% filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATETIME ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFF
Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple d ...

Again results of both executions are the same but difference is getting bigger and bigger. Main reason given is that all the data must be transferred to R engine, before being manipulated and results returned.

Now I can perform a simple comparison with dplyr package and RevoscaleR package. In this manner we will have same ground base for comparison.

PRINT ' ' PRINT 'STATISTICS WITH R dpylr' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N' library(dplyr) OutputDataSet <- InputDataSet %>% filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATETIME ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R RevoScaleR' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL1 AS NVARCHAR(MAX) SET @TSQL1 = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLA

Viewing all articles
Browse latest Browse all 3160

Trending Articles