Data science in SQL Server: pivoting and transposing data

In data science, understanding and preparing data is critical. In this article, in the series, we’ll discuss understanding and preparing data by using SQL transpose and pivot techniques.

Transposing a matrix means reversing rows and columns. The data frames in R and python are actually matrices, and therefore transposing is not a problem. A SQL Server table is a slightly different structure, where rows and columns are not equivalent and interchangeable. A row represents an entity, and a column an attribute of an entity. However, you might get data in SQL Server form a matrix from other systems, and meet the need to transpose it. Transposing data does not involve aggregations.

Pivoting is a similar operation. You need three columns for the operation. The first column is giving the row groups. The second column is transposed; its values are used for the new columns’ names. The third column gives the values, which are aggregated over rows and columns.

T-SQL Pivot operator

As you are probably already used to in my data science articles, I will start with preparing some data. I am using the data from the AdventureWorksDW2017 database. Note that I switched from the SQL Server 2016 version of this database I used in the previous articles to the SQL Server 2017 version. Don’t worry, the structure of the database and the data is nearly the same.

USE AdventureWorksDW2017; -- Data preparation SELECT g.EnglishCountryRegionName AS Country, g.StateProvinceName AS State, g.EnglishCountryRegionName + ' ' + g.StateProvinceName AS CountryState, d.CalendarYear AS CYear, SUM(s.SalesAmount) AS Sales INTO dbo.SalesGeoYear FROM dbo.FactInternetSales s INNER JOIN dbo.DimDate d ON d.DateKey = s.OrderDateKey INNER JOIN dbo.DimCustomer c ON c.CustomerKey = s.CustomerKey INNER JOIN dbo.DimGeography g ON g.GeographyKey = c.GeographyKey WHERE g.EnglishCountryRegionName IN (N'Australia', N'Canada') GROUP BY g.EnglishCountryRegionName, g.StateProvinceName, d.CalendarYear; GO

You can make a quick overview of the data with the following two queries. Note the distinct years returned by the second query.

SELECT TOP 5 * FROM dbo.SalesGeoYear; SELECT DISTINCT CYear FROM dbo.SalesGeoYear;

The distinct years with sales are from 2010 to 2014. Besides years, I aggregated sales over countries and states as well. I also added a combined column CountryState in the table dbo.SalesGeoYear I will use further in this article.

Let me start with the T-SQL PIVOT operator. The following query calculates the sum of the sales over countries and years. Please note the syntax for the PIVOT operator. The sales column is used for the aggregation, and the CYear column for the labels of the new pivoted columns. Grouping is implicit; all other columns, not used for pivoting or aggregation, are used in an implicit GROUP BY.

WITH PCTE AS ( SELECT Country, CYear, Sales FROM dbo.SalesGeoYear ) SELECT Country, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (SUM(Sales) FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;

Here is the result of the query.

Data science in SQL Server: pivoting and transposing data

Of course, you can change the aggregate function. For example, the following query calculates the count of the sales over countries and years.

WITH PCTE AS ( SELECT Country, CYear, Sales FROM dbo.SalesGeoYear ) SELECT Country, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (COUNT(Sales) FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;

You probably noticed that I used a common table expression to prepare the rowset for pivoting; I am not using the table directly. This is due to the implicit grouping. Somebody that defined the syntax for the PIVOT operator wanted to make the code shorter; however, because of that, you need to write more code to be on the safe side. Columns that are not used for pivoting and aggregating are used for grouping. What happens if you read a column more, like in the following query?

WITH PCTE AS ( SELECT Country, State, CYear, Sales FROM dbo.SalesGeoYear ) SELECT Country, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (SUM(Sales) FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;

I read also the column State in the CTE. I am not using it in the outer query. However, the result is quite different from the previous one.

The query did implicit grouping over two columns, Country and State.

The PIVOT operator is not really intended for transposing the table. You always need to have an aggregate function. However, you can simulate transposing when you have a single value over rows and pivoted columns with the MIN() or MAX() aggregate functions. For example, the following query does not work:

WITH PCTE AS ( SELECT CountryState, CYear, Sales FROM dbo.SalesGeoYear ) SELECT CountryState, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (Sales FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;

But, as mentioned, it is easy to change it to a query that does work, the query that just transposes the data, without aggregation. Or, to be precise, the aggregation exists, with the MAX() function, on a single value, returning the value itself.

WITH PCTE AS ( SELECT CountryState, CYear, Sales FROM dbo.SalesGeoYear ) SELECT CountryState, [2010], [2011], [2012], [2013], [2014] FROM PCTE PIVOT (MAX(Sales) FOR CYear IN ([2010], [2011], [2012], [2013], [2014])) AS P;

The PIVOT operator is T-SQL proprietary operator. It is not part of the ANSI SQL standard. You can write pivoting queries with ANSI standard SQL as well, using the CASE expression, like the following query shows.

SELECT Country, SUM(CASE WHEN CYear = 2010 THEN Sales END) AS [2010], SUM(CASE WHEN CYear = 2011 THEN Sales END) AS [2011], SUM(CASE WHEN CYear = 2012 THEN Sales END) AS [2012], SUM(CASE WHEN CYear = 2013 THEN Sales END) AS [2013], SUM(CASE WHEN CYear = 2014 THEN Sales END) AS [2014] FROM dbo.SalesGeoYear GROUP BY Country;

Besides implicit grouping, there is another problem with the PIVOT operator. You can’t get the list of the distinct values of the pivoted column dynamically, with a subquery. You need to use dynamic SQL for this task. I am showing how to create a pivoting query dynamically in the following code. Note that I create the concatenated list of pivoted column names with the STRING_AGG() function, which is new in SQL Server 2017.

DECLARE @stmtvar AS NVARCHAR(4000); SET @stmtvar = N' WITH PCTE AS ( SELECT Country, CYear, Sales FROM dbo.SalesGeoYear ) SELECT * FROM PCTE PIVOT (SUM(Sales) FOR CYear IN (' + (SELECT STRING_AGG(Cyear, N', ') WITHIN GROUP (ORDER BY CYear) FROM ( SELECT DISTINCT QUOTENAME(CYear) AS CYear FROM dbo.SalesGeoYear ) AS Y) + N')) AS P;'; EXEC sys.sp_executesql @stmt = @stmtvar;

This will be enough T_SQL for this article, I am switching to R now.

Transposing and Pivoting in R

As always, I am starting with reading the data from SQL Server in an R data frame. I am also using the View() function to show it immediately.

library(RODBC) con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd") SGY <- as.data.frame(sqlQuery(con, "SELECT Country, State, CountryState, CYear, Sales FROM dbo.SalesGeoYear;"), stringsAsFactors = TRUE) close(con) View(SGY)

The simplest way to transpose the data is with the t() function from the basic package:

t(SGY)

The transposed matrix is not very readable. Here is the partial result.

Data science in SQL Server: pivoting and transposing data

Trending Articles

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

HONDA CITY VTI-S 菜單分享

#新闻拍一拍# 新的摩尔定律：黄氏定律

一如既往的痴情能否打动月瓶金蝎？ (豆瓣月亮水瓶小组)

求購按摩椅~'~

「粉红」不是霸凌辜莞允杠部落客：我爽在哪？

Intel 7-10代集成显卡驱动31.0.101.2137完整版

涉Gotbit加密货币市场操纵台男纽约被捕

臺灣法治會計學會2025年第三季研討會

不靠姊姊！張柏芝弟弟開計程車維生

关门一家亲：习远平、张澜澜、徐才厚

剑指offer——24.二叉树中和为某一值的路径

苏珊米勒日晕05.11｜狮子鼓励孩子；处女相信自己 (豆瓣 SUSAN MILLER小组)

【台積電IT卓越新戰略5】台積IT組織5年三次大調整，要靠平臺工程讓DevOps創新再加速

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

美籍老公不讓步李愛綺兒子念公立小學

新华网这张照片绝了!直讽江泽民宋祖英淫乱组图

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

Google Chrome Portable 140.0.7339.186 穩定版免安裝中文版 - Google 瀏覽器

免费翻墙节点大全