Long I wanted to test a simple data manipulation tasks and compare the execution time, ease of writing the code and simplicity between T-SQL and R package for data manipulation.
Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R package dplyr and T-SQL with possibilites of RevoScaleR computation functions.
My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).
USE WideWorldImportersDW; GO SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN [Dimension].[City] AS c ON c.[City Key] = fs.[City Key];For the metrics of comparison I will be using simple time statistics (metrics:elapsed time) and every code will be run from SSMS.
-- Example SET STATISTICS TIME ON SELECT GETDATE() SET STATISTICS TIME OFF SELECTFirst simple test is just printing the data.
PRINT ' ' PRINT 'STATISTICS WITH T-SQL' PRINT ' ' SET STATISTICS TIME ON SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key] SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N'OutputDataSet <- InputDataSet' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATE ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFFThis test is very much simple. Outputting the data. In case to T-SQL it is just a simple SELECT statement and in case of R-TSQL is just passing data to R Engine and returning in a data-frame. The test itself is trivial and “absurd” but very important to get the base-line and time related results.

Running the query couple of times on my client, I have some base comparison. T-SQL outperforms R-TSQL and it is almost two-times faster. I would expect higher difference in terms of time, because it involves a lot of data migration and no computations.
Filtering / WHERENow let’s perform a simple WHERE clause or some simple filtering.
PRINT ' ' PRINT 'STATISTICS WITH T-SQL' PRINT ' ' -- SIMPLE T-SQL SET STATISTICS TIME ON SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key] WHERE [Sales Territory] <> 'Southeast' AND fs.[Customer Key] <> 0 AND Profit <= 1000 SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R' PRINT ' ' -- R Package dplyr and T-SQL SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N' library(dplyr) OutputDataSet <- InputDataSet %>% filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATETIME ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFF
Again results of both executions are the same but difference is getting bigger and bigger. Main reason given is that all the data must be transferred to R engine, before being manipulated and results returned.
Now I can perform a simple comparison with dplyr package and RevoscaleR package. In this manner we will have same ground base for comparison.
PRINT ' ' PRINT 'STATISTICS WITH R dpylr' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N' library(dplyr) OutputDataSet <- InputDataSet %>% filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATETIME ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R RevoScaleR' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL1 AS NVARCHAR(MAX) SET @TSQL1 = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLA