Quantcast
Channel: CodeSection,代码区,SQL Server(mssql)数据库 技术分享 - CodeSec
Viewing all articles
Browse latest Browse all 3160

Load SQL Server T-SQL Query Results to Hadoop Using Sqoop

$
0
0
Problem

I need to load the results from a SQL Server T-SQL query to a Hadoop Distributed File System. How can I do this? How can I validate the data was loaded correctly?

Solution

Apache's Sqoop has a query argument that will allow you to pass in a SQL Server T-SQL query so you may select specific columns and rows to be loaded to the Hadoop Distributed File System (HDFS). Also you canjoin tables and useT-SQL functions for formatting. In this example, we will join the Sales.SalesOrderHeader and Sales.SalesOrderDetail tables in the SQL Server 2014 AdventureWorks2014 database to return the SalesOrderDetailID column, the OrderDate column formatted as YYYYMMDD, and the OrderQty and LineTotal columns where the LineTotal is 100.00 or greater. As with previous tips in this series, this tip is written using SQL Server 2014 and a Cloudera virtual machine running on a laptop.

The Sqoop command we will examine is listed below and we will break down each argument in the following bullet points. Please note that the command is supposed to be on one complete line or with the backslash (the linux command line continuation character) at the end of each line except the last.

sqoop import --connect 'jdbc:sqlserver://aaa.bbb.ccc.ddd:pppp;databasename=AdventureWorks2014' --username 'sqoopLogin' -P --target-dir 'Sales100OrGreater' --split-by 'SalesOrderDetailID' --query "select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate,
d.OrderQty, d.LineTotal
from Sales.SalesOrderDetail as d
inner join Sales.SalesOrderHeader as h
on d.SalesOrderID = h.SalesOrderID
where d.LineTotal>=100.00 and \$CONDITIONS" sqoop import - The executable is named sqoop and we are instructing it to import the data from a table or view from a database to HDFS. --connect - With the --connect argument, we are passing in the jdbc connect string for SQL Server. In this case, we use the IP address, port number, and database name. --username - In this example, the user name is a SQL Server login, not a windows login. Our database is set up to authenticate in mixed mode. We have a server login named sqoopLogin, and we have a database user name sqoopUser which is a member of the db_datareader role and has a default schema of dbo. -P - This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to automatically pass the password to the command, but we are trying to keep it simple for this tip. --target-dir - The target directory tells sqoop in which directory on the HDFS to store the selected data. This argument is required by sqoop when using a free-form query. --split-by - Even though we are selecting the primary key of the SalesOrderDetail table, we still have to provide Sqoop with a unique identifier to help it distribute the workload. --query - This is the argument in which we supply our T-SQL query. The query above is enclosed in double quotes. Also notice there is not a backslash (the line continuation character) in the multiple lines containing the query. Finally, notice the "and \$CONDITIONS" at the end of the WHERE clause. This token is required by sqoop because sqoop will automatically replace the $CONDITIONS token with a unique expression. Please do not define $CONDITIONS as an environment variable.

The output from the sqoop command is shown below.

[hdfs@localhost ~]$ ./sqoopCommand.sh
16/08/23 08:07:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password:
16/08/23 08:07:53 INFO manager.SqlManager: Using default fetchSize of 1000
16/08/23 08:07:53 INFO tool.CodeGenTool: Beginning code generation
16/08/23 08:07:54 INFO manager.SqlManager: Executing SQL statement: select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and (1 = 0)
16/08/23 08:07:55 INFO manager.SqlManager: Executing SQL statement: select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and (1 = 0)
16/08/23 08:07:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/822e63bc3c3869e0c10ccacb69821dc5/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/23 08:07:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/822e63bc3c3869e0c10ccacb69821dc5/QueryResult.jar
16/08/23 08:07:59 INFO mapreduce.ImportJobBase: Beginning query import.
16/08/23 08:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/08/23 08:08:08 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/23 08:08:08 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(SalesOrderDetailID), MAX(SalesOrderDetailID) FROM (select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and (1 = 1) ) AS t1
16/08/23 08:08:09 INFO mapred.JobClient: Running job: job_201608230640_0002
16/08/23 08:08:10 INFO mapred.JobClient: map 0% reduce 0%
16/08/23 08:08:37 INFO mapred.JobClient: map 50% reduce 0%
16/08/23 08:08:51 INFO mapred.JobClient: map 75% reduce 0%
16/08/23 08:08:52 INFO mapred.JobClient: map 100% reduce 0%
16/08/23 08:08:55 INFO mapred.JobClient: Job complete: job_201608230640_0002
16/08/23 08:08:55 INFO mapred.JobClient: Counters: 23
16/08/23 08:08:55 INFO mapred.JobClient: File System Counters
16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of bytes read=0
16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of bytes written=1190836
16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of read operations=0
16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of large read operations=0
16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of write operations=0
16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of bytes read=534
16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of bytes written=1814075
16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of read operations=12
16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of large read operations=0
16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of write operations=6
16/08/23 08:08:55 INFO mapred.JobClient: Job Counters
16/08/23 08:08:55 INFO mapred.JobClient: Launched map tasks=4
16/08/23 08:08:55 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=69593
16/08/23 08:08:55 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient: Map-Reduce Framework
16/08/23 08:08:55 INFO mapred.JobClient: Map input records=63605
16/08/23 08:08:55 INFO mapred.JobClient: Map output records=63605
16/08/23 08:08:55 INFO mapred.JobClient: Input split bytes=534
16/08/23 08:08:55 INFO mapred.JobClient: Spilled Records=0
16/08/23 08:08:55 INFO mapred.JobClient: CPU time spent (ms)=11250
16/08/23 08:08:55 INFO mapred.JobClient: Physical memory (

Viewing all articles
Browse latest Browse all 3160

Trending Articles