In data science, understanding and preparing data is critical. In this article, in the series, we’ll discuss understanding and preparing data by using SQL transpose and pivot techniques.
Transposing a matrix means reversing rows and columns. The data frames in R and python are actually matrices, and therefore transposing is not a problem. A SQL Server table is a slightly different structure, where rows and columns are not equivalent and interchangeable. A row represents an entity, and a column an attribute of an entity. However, you might get data in SQL Server form a matrix from other systems, and meet the need to transpose it. Transposing data does not involve aggregations.
Pivoting is a similar operation. You need three columns for the operation. The first column is giving the row groups. The second column is transposed; its values are used for the new columns’ names. The third column gives the values, which are aggregated over rows and columns.
T-SQL Pivot operatorAs you are probably already used to in my data science articles, I will start with preparing some data. I am using the data from the AdventureWorksDW2017 database. Note that I switched from the SQL Server 2016 version of this database I used in the previous articles to the SQL Server 2017 version. Don’t worry, the structure of the database and the data is nearly the same.
USE AdventureWorksDW2017; -- Data preparation SELECT g.EnglishCountryRegionName AS Country, g.StateProvinceName AS State, g.EnglishCountryRegionName + ' ' + g.StateProvinceName AS CountryState, d.CalendarYear AS CYear, SUM(s.SalesAmount) AS Sales INTO dbo.SalesGeoYear FROM dbo.FactInternetSales s INNER JOIN dbo.DimDate d ON d.DateKey = s.OrderDateKey INNER JOIN dbo.DimCustomer c ON c.CustomerKey = s.CustomerKey INNER JOIN dbo.DimGeography g ON g.GeographyKey = c.GeographyKey WHERE g.EnglishCountryRegionName IN (N'Australia', N'Canada') GROUP BY g.EnglishCountryRegionName, g.StateProvinceName, d.CalendarYear; GOYou can make a quick overview of the data with the following two queries. Note the distinct years returned by the second query.
SELECT TOP 5 * FROM dbo.SalesGeoYear; SELECT DISTINCT CYear FROM dbo.SalesGeoYear;The distinct years with sales are from 2010 to 2014. Besides years, I aggregated sales over countries and states as well. I also added a combined column CountryState in the table dbo.SalesGeoYear I will use further in this article.
Let me start with the T-SQL PIVOT operator. The following query calculates the sum of the sales over countries and years. Please note the syntax for the PIVOT operator. The sales column is used for the aggregation, and the CYear column for the labels of the new pivoted columns. Grouping is implicit; all other columns, not used for pivoting or aggregation, are used in an implicit GROUP BY.
WITH PCTE AS ( SELECT Country, CYear, Sales FROM dbo.SalesGeoYear ) SELECT Country, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (SUM(Sales) FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;Here is the result of the query.

Of course, you can change the aggregate function. For example, the following query calculates the count of the sales over countries and years.
WITH PCTE AS ( SELECT Country, CYear, Sales FROM dbo.SalesGeoYear ) SELECT Country, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (COUNT(Sales) FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;You probably noticed that I used a common table expression to prepare the rowset for pivoting; I am not using the table directly. This is due to the implicit grouping. Somebody that defined the syntax for the PIVOT operator wanted to make the code shorter; however, because of that, you need to write more code to be on the safe side. Columns that are not used for pivoting and aggregating are used for grouping. What happens if you read a column more, like in the following query?
WITH PCTE AS ( SELECT Country, State, CYear, Sales FROM dbo.SalesGeoYear ) SELECT Country, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (SUM(Sales) FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;I read also the column State in the CTE. I am not using it in the outer query. However, the result is quite different from the previous one.

The query did implicit grouping over two columns, Country and State.
The PIVOT operator is not really intended for transposing the table. You always need to have an aggregate function. However, you can simulate transposing when you have a single value over rows and pivoted columns with the MIN() or MAX() aggregate functions. For example, the following query does not work:
WITH PCTE AS ( SELECT CountryState, CYear, Sales FROM dbo.SalesGeoYear ) SELECT CountryState, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (Sales FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;But, as mentioned, it is easy to change it to a query that does work, the query that just transposes the data, without aggregation. Or, to be precise, the aggregation exists, with the MAX() function, on a single value, returning the value itself.
WITH PCTE AS ( SELECT CountryState, CYear, Sales FROM dbo.SalesGeoYear ) SELECT CountryState, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (MAX(Sales) FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;The PIVOT operator is T-SQL proprietary operator. It is not part of the ANSI SQL standard. You can write pivoting queries with ANSI standard SQL as well, using the CASE expression, like the following query shows.
SELECT Country, SUM(CASE WHEN CYear = 2010 THEN Sales END) AS [2010], SUM(CASE WHEN CYear = 2011 THEN Sales END) AS [2011], SUM(CASE WHEN CYear = 2012 THEN Sales END) AS [2012], SUM(CASE WHEN CYear = 2013 THEN Sales END) AS [2013], SUM(CASE WHEN CYear = 2014 THEN Sales END) AS [2014] FROM dbo.SalesGeoYear GROUP BY Country;Besides implicit grouping, there is another problem with the PIVOT operator. You can’t get the list of the distinct values of the pivoted column dynamically, with a subquery. You need to use dynamic SQL for this task. I am showing how to create a pivoting query dynamically in the following code. Note that I create the concatenated list of pivoted column names with the STRING_AGG() function, which is new in SQL Server 2017.
DECLARE @stmtvar AS NVARCHAR(4000); SET @stmtvar = N' WITH PCTE AS ( SELECT Country, CYear, Sales FROM dbo.SalesGeoYear ) SELECT * FROM PCTE PIVOT (SUM(Sales) FOR CYear IN (' + (SELECT STRING_AGG(Cyear, N', ') WITHIN GROUP (ORDER BY CYear) FROM ( SELECT DISTINCT QUOTENAME(CYear) AS CYear FROM dbo.SalesGeoYear ) AS Y) + N')) AS P;'; EXEC sys.sp_executesql @stmt = @stmtvar;This will be enough T_SQL for this article, I am switching to R now.
Transposing and Pivoting in RAs always, I am starting with reading the data from SQL Server in an R data frame. I am also using the View() function to show it immediately.
library(RODBC) con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd") SGY <- as.data.frame(sqlQuery(con, "SELECT Country, State, CountryState, CYear, Sales FROM dbo.SalesGeoYear;"), stringsAsFactors = TRUE) close(con) View(SGY)The simplest way to transpose the data is with the t() function from the basic package:
t(SGY)The transposed matrix is not very readable. Here is the partial result.