Export from Hadoop File System to a SQL Server Database Table

Problem

I need to export data from the Hadoop Distributed File System (HDFS) to a SQL Server database table. How can I do this?

Solution

Apache's Sqoop allows for importing data from a database such as SQL Server to the HDFS, and for exporting data from the HDFS to a database table. In this tip we will use Apache Sqoop's export functionality to move data stored on the HDFS to a SQL Server database table. As with previous tips in this series, this tip is written using SQL Server 2014 and a Cloudera virtual machine running on a laptop.

The contents of the file on the HDFS are shown below. There are three columns in this comma-separated value file. The first column is a unique integer identifier. The second column is a string used for a description. The third column is the unit cost. The destination table will have columns to accommodate these three columns and their data types.

[hdfs@localhost:/mssqltips]$ hdfs dfs -cat mssqlTips/linuxDataFile.csv
1,Product A,1.01
2,Product B,2.02
3,Product C,3.03
4,Product D,4.04
5,Product E,5.05
6,Product F,6.06
7,Product G,7.07
8,Product H,8.08
9,Product I,9.09
10,Product J,10.10
11,Product K,11.11
12,Product L,12.12
13,Product M,13.13
14,Product N,14.14
15,Product O,15.15
16,Product P,16.16
17,Product Q,17.17
18,Product R,18.18
19,Product S,19.19
20,Product T,20.20
21,Product U,21.21
22,Product V,22.22
23,Product W,23.23
24,Product X,24.24
25,Product Y,25.25
26,Product Z,26.26
[hdfs@localhost:/mssqltips]$

The T-SQL below was used to create the destination table for this tip.

create table dbo.tblProductData
(
ProductKey int not null PRIMARY KEY,
ProductName varchar(50),
UnitCost money
)

The image below shows the destination table in the AdventureWorks2014 database as displayed in the Object Explorer.

.
Image may be NSFW.
Clik here to view. Export from Hadoop File System to a SQL Server Database Table

The Sqoop command we will examine is listed below and we will break down each argument in the following bullet points. Please note that the command is supposed to be on one complete line or with the backslash (the Linux command line continuation character) at the end of each line except the last.

sqoop export --connect 'jdbc:sqlserver://aaa.bbb.ccc.ddd:pppp;databasename=AdventureWorks2014' --username 'sqoopLogin' -P --export-dir 'mssqlTips' --table 'tblProductData' -- --schema dbo sqoop export - The executable is named sqoop and we are instructing it to export the data from an HDFS file to a database. --connect - With the --connect argument, we are passing in the jdbc connect string for SQL Server. In this case, we use the IP address, port number, and database name. --username - In this example, the user name is a SQL Server login, not a windows login. Our database is set up to authenticate in mixed mode. We have a server login named sqoopLogin, and we have a database user name sqoopUser which is a member of the db_datawriter role and has a default schema of dbo. -P - This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to automatically pass the password to the command, but we are trying to keep it simple for this tip. --export-dir - The export directory tells sqoop in which directory the file is stored on the HDFS. This argument is required by Sqoop when using the export option. --table - This argument provides sqoop with the destination table on the SQL Server database. This argument is also required by Sqoop when using the export option. -- This is not a typographical error. The double dash tells Sqoop that the following arguments are database specific. --schema - This argument specifies the schema in which our table resides.

The output from the sqoop command is shown below.

[hdfs@localhost:/mssqltips]$ ./sqoopExportCommand.sh
16/09/19 12:03:22 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password:
16/09/19 12:03:25 INFO manager.SqlManager: Using default fetchSize of 1000
16/09/19 12:03:25 INFO manager.SQLServerManager: We will use schema dbo
16/09/19 12:03:25 INFO tool.CodeGenTool: Beginning code generation
16/09/19 12:03:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [dbo].[tblProductData] AS t WHERE 1=0
16/09/19 12:03:26 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/3eaf9ad50f5ebf5ef893fba0e41859b1/tblProductData.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/09/19 12:03:29 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/3eaf9ad50f5ebf5ef893fba0e41859b1/tblProductData.jar
16/09/19 12:03:29 INFO mapreduce.ExportJobBase: Beginning export of tblProductData
16/09/19 12:03:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/09/19 12:03:32 INFO input.FileInputFormat: Total input paths to process : 1
16/09/19 12:03:32 INFO input.FileInputFormat: Total input paths to process : 1
16/09/19 12:03:33 INFO mapred.JobClient: Running job: job_201609071028_0005
16/09/19 12:03:34 INFO mapred.JobClient: map 0% reduce 0%
16/09/19 12:03:48 INFO mapred.JobClient: map 25% reduce 0%
16/09/19 12:03:49 INFO mapred.JobClient: map 50% reduce 0%
16/09/19 12:04:00 INFO mapred.JobClient: map 100% reduce 0%
16/09/19 12:04:02 INFO mapred.JobClient: Job complete: job_201609071028_0005
16/09/19 12:04:02 INFO mapred.JobClient: Counters: 24
16/09/19 12:04:02 INFO mapred.JobClient: File System Counters
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of bytes read=0
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of bytes written=1185776
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of read operations=0
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of large read operations=0
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of write operations=0
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of bytes read=1778
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of bytes written=0
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of read operations=16
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of large read operations=0
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of write operations=0
16/09/19 12:04:02 INFO mapred.JobClient: Job Counters
16/09/19 12:04:02 INFO mapred.JobClient: Launched map tasks=4
16/09/19 12:04:02 INFO mapred.JobClient: Data-local map tasks=4
16/09/19 12:04:02 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=47822
16/09/19 12:04:02 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient: Map-Reduce Framework
16/09/19 12:04:02 INFO mapred.JobClient: Map input records=26
16/09/19 12:04:02 INFO mapred.JobClient: Map output records=26
16/09/19 12:04:02 INFO mapred.JobClient: Input split bytes=576
16/09/19 12:04:02 INFO mapred.JobClient: Spilled Records=0
16/09/19 12:04:02 INFO mapred.JobClient: CPU time spent (ms)=4580
16/09/19 12:04:02 INFO mapred.JobClient: Physical memory (bytes) snapshot=452489216
16/09/19 12:04:02 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2948849664
16/09/19 12:04:02 INFO mapred.JobClient: Total committed heap usage (bytes)=127401984
16/09/19 12:04:02 INFO mapreduce.ExportJobBase: Transferred 1.7363 KB in 31.5282 seconds (56.394 bytes/sec)
16/09/19 12:04:02 INFO mapreduce.ExportJobBase: Exported 26 records.
[hdfs@localhost:/mssqltips]$

Notice the last line of output above shows that 26 records were exported. This corresponds to the 26 records in the file on the HDFS. When we query the count of rows in our database table, we see that 26 rows are present as displayed in the image below. Also in the image below, we see the first 17 records returned by the query.

Image may be NSFW.
Clik here to view. Export from Hadoop File System to a SQL Server Database Table

Next Steps

Please experiment with different data types. Also, please check out these other tips and tutorials on big data and Sqoop onMSSQLTips.com.

SQL Server Big Data Tips Sqoop Runtime Exception: Cannot Load SQL Server Driver Use Sqoop to Load Data from a SQL Server Table to a Hadoop Distributed File System Using Sqoop WHERE Argument to Filter Data from a SQL Server Our complete tutorial list

Last Update: 10/10/2016

Image may be NSFW.
Clik here to view. Export from Hadoop File System to a SQL Server Database Table

About the author
Image may be NSFW.
Clik here to view. Export from Hadoop File System to a SQL Server Database Table

Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience. View all my tips

Related Resources

More SQL Server DBA Tips...

Export from Hadoop File System to a SQL Server Database Table

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本