使用Sqoop在HDFS和RDBMS之间导数据


SQOOP是一款开源的工具,主要用于在Hadoop与传统的数据库间进行数据的传递,下面从SQOOP用户手册上摘录一段描述

Sqoopis a tool designed to transfer data between Hadoop and relational databases.You can use Sqoop to import data from a relational database management system(RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System(HDFS),transform the data in Hadoop MapReduce, and then export the data backinto an RDBMS.

SQOOP是Cloudera公司开源的一款在HDFS以及数据库之间传输数据的软件。内部通过JDBC连接HADOOP以及数据库,因此从理论上来讲,只要是支持JDBC的数据库,SQOOP都可以兼容。并且,SQOOP不仅能把数据以文件的形式导入到HDFS上,还可以直接导入数据到HBASE或者HIVE中。

下面是一些性能测试数据,仅供参考:

表名:tb_keywords

行数:11628209

数据文件大小:1.4G

 

HDFS –> DB

DB -> HDFS

SQOOP

428s

166s

HDFS<->FILE<->DB

209s

105s

从结果上来看,以FILE作为中转方式性能是要高于SQOOP的。原因如下:

1、 本质上SQOOP使用的是JDBC,效率不会比MYSQL自带的到导入\导出工具效率高

2、 以导入数据到DB为例,SQOOP的设计思想是分阶段提交,也就是说假设一个表有1K行,那么它会先读出100行(默认值),然后插入,提交,再读取100行……如此往复

即便如此,SQOOP也是有优势的,比如说使用的便利性,任务执行的容错性等。在一些测试环境中如果需要的话可以考虑把它拿来作为一个工具使用。

下面是一些操作记录

[wanghai01@tc-crm-rd01.tc.baidu.com bin]$ sh export.sh
Fri Sep 23 20:15:47 CST 2011
11/09/23 20:15:48 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
11/09/23 20:15:48 INFO tool.CodeGenTool: Beginning code generation
11/09/23 20:15:48 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:15:48 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:15:48 INFO orm.CompilationManager: HADOOP_HOME is /home/wanghai01/hadoop/hadoop-0.20.2/bin/..
11/09/23 20:15:48 INFO orm.CompilationManager: Found hadoop core jar at: /home/wanghai01/hadoop/hadoop-0.20.2/bin/../hadoop-0.20.2-core.jar
11/09/23 20:15:49 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-wanghai01/compile/eb16aae87a119b93acb3bc6ea74b5e97/tb_keyword_data_201104.java to /home/wanghai01/cloudera/sqoop-1.2.0-CDH3B4/bin/./tb_keyword_data_201104.java
11/09/23 20:15:49 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wanghai01/compile/eb16aae87a119b93acb3bc6ea74b5e97/tb_keyword_data_201104.jar
11/09/23 20:15:49 INFO mapreduce.ExportJobBase: Beginning export of tb_keyword_data_201104
11/09/23 20:15:49 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:15:49 INFO input.FileInputFormat: Total input paths to process : 1
11/09/23 20:15:49 INFO input.FileInputFormat: Total input paths to process : 1
11/09/23 20:15:49 INFO mapred.JobClient: Running job: job_201109211521_0012
11/09/23 20:15:50 INFO mapred.JobClient:  map 0% reduce 0%
11/09/23 20:16:04 INFO mapred.JobClient:  map 1% reduce 0%
11/09/23 20:16:10 INFO mapred.JobClient:  map 2% reduce 0%
11/09/23 20:16:13 INFO mapred.JobClient:  map 3% reduce 0%
11/09/23 20:16:19 INFO mapred.JobClient:  map 4% reduce 0%
11/09/23 20:16:22 INFO mapred.JobClient:  map 5% reduce 0%
11/09/23 20:16:25 INFO mapred.JobClient:  map 6% reduce 0%
11/09/23 20:16:31 INFO mapred.JobClient:  map 7% reduce 0%
11/09/23 20:16:34 INFO mapred.JobClient:  map 8% reduce 0%
11/09/23 20:16:41 INFO mapred.JobClient:  map 9% reduce 0%
11/09/23 20:16:44 INFO mapred.JobClient:  map 10% reduce 0%
11/09/23 20:16:50 INFO mapred.JobClient:  map 11% reduce 0%
11/09/23 20:16:53 INFO mapred.JobClient:  map 12% reduce 0%
11/09/23 20:16:56 INFO mapred.JobClient:  map 13% reduce 0%
11/09/23 20:17:02 INFO mapred.JobClient:  map 14% reduce 0%
11/09/23 20:17:05 INFO mapred.JobClient:  map 15% reduce 0%
11/09/23 20:17:11 INFO mapred.JobClient:  map 16% reduce 0%
11/09/23 20:17:14 INFO mapred.JobClient:  map 17% reduce 0%
11/09/23 20:17:17 INFO mapred.JobClient:  map 18% reduce 0%
11/09/23 20:17:23 INFO mapred.JobClient:  map 19% reduce 0%
11/09/23 20:17:25 INFO mapred.JobClient:  map 20% reduce 0%
11/09/23 20:17:28 INFO mapred.JobClient:  map 21% reduce 0%
11/09/23 20:17:34 INFO mapred.JobClient:  map 22% reduce 0%
11/09/23 20:17:37 INFO mapred.JobClient:  map 23% reduce 0%
11/09/23 20:17:43 INFO mapred.JobClient:  map 24% reduce 0%
11/09/23 20:17:46 INFO mapred.JobClient:  map 25% reduce 0%
11/09/23 20:17:49 INFO mapred.JobClient:  map 26% reduce 0%
11/09/23 20:17:55 INFO mapred.JobClient:  map 27% reduce 0%
11/09/23 20:17:58 INFO mapred.JobClient:  map 28% reduce 0%
11/09/23 20:18:04 INFO mapred.JobClient:  map 29% reduce 0%
11/09/23 20:18:07 INFO mapred.JobClient:  map 30% reduce 0%
11/09/23 20:18:10 INFO mapred.JobClient:  map 31% reduce 0%
11/09/23 20:18:16 INFO mapred.JobClient:  map 32% reduce 0%
11/09/23 20:18:19 INFO mapred.JobClient:  map 33% reduce 0%
11/09/23 20:18:25 INFO mapred.JobClient:  map 34% reduce 0%
11/09/23 20:18:28 INFO mapred.JobClient:  map 35% reduce 0%
11/09/23 20:18:31 INFO mapred.JobClient:  map 36% reduce 0%
11/09/23 20:18:37 INFO mapred.JobClient:  map 37% reduce 0%
11/09/23 20:18:40 INFO mapred.JobClient:  map 38% reduce 0%
11/09/23 20:18:46 INFO mapred.JobClient:  map 39% reduce 0%
11/09/23 20:18:49 INFO mapred.JobClient:  map 40% reduce 0%
11/09/23 20:18:52 INFO mapred.JobClient:  map 41% reduce 0%
11/09/23 20:18:58 INFO mapred.JobClient:  map 42% reduce 0%
11/09/23 20:19:01 INFO mapred.JobClient:  map 43% reduce 0%
11/09/23 20:19:04 INFO mapred.JobClient:  map 44% reduce 0%
11/09/23 20:19:10 INFO mapred.JobClient:  map 45% reduce 0%
11/09/23 20:19:13 INFO mapred.JobClient:  map 46% reduce 0%
11/09/23 20:19:19 INFO mapred.JobClient:  map 47% reduce 0%
11/09/23 20:19:22 INFO mapred.JobClient:  map 48% reduce 0%
11/09/23 20:19:25 INFO mapred.JobClient:  map 49% reduce 0%
11/09/23 20:19:34 INFO mapred.JobClient:  map 50% reduce 0%
11/09/23 20:19:37 INFO mapred.JobClient:  map 52% reduce 0%
11/09/23 20:19:40 INFO mapred.JobClient:  map 53% reduce 0%
11/09/23 20:19:43 INFO mapred.JobClient:  map 54% reduce 0%
11/09/23 20:19:46 INFO mapred.JobClient:  map 55% reduce 0%
11/09/23 20:19:49 INFO mapred.JobClient:  map 56% reduce 0%
11/09/23 20:19:52 INFO mapred.JobClient:  map 57% reduce 0%
11/09/23 20:19:55 INFO mapred.JobClient:  map 58% reduce 0%
11/09/23 20:19:58 INFO mapred.JobClient:  map 59% reduce 0%
11/09/23 20:20:01 INFO mapred.JobClient:  map 60% reduce 0%
11/09/23 20:20:04 INFO mapred.JobClient:  map 62% reduce 0%
11/09/23 20:20:07 INFO mapred.JobClient:  map 63% reduce 0%
11/09/23 20:20:10 INFO mapred.JobClient:  map 64% reduce 0%
11/09/23 20:20:13 INFO mapred.JobClient:  map 65% reduce 0%
11/09/23 20:20:16 INFO mapred.JobClient:  map 66% reduce 0%
11/09/23 20:20:19 INFO mapred.JobClient:  map 67% reduce 0%
11/09/23 20:20:22 INFO mapred.JobClient:  map 68% reduce 0%
11/09/23 20:20:25 INFO mapred.JobClient:  map 69% reduce 0%
11/09/23 20:20:28 INFO mapred.JobClient:  map 70% reduce 0%
11/09/23 20:20:31 INFO mapred.JobClient:  map 72% reduce 0%
11/09/23 20:20:34 INFO mapred.JobClient:  map 73% reduce 0%
11/09/23 20:20:37 INFO mapred.JobClient:  map 74% reduce 0%
11/09/23 20:20:40 INFO mapred.JobClient:  map 75% reduce 0%
11/09/23 20:20:43 INFO mapred.JobClient:  map 76% reduce 0%
11/09/23 20:20:46 INFO mapred.JobClient:  map 77% reduce 0%
11/09/23 20:20:49 INFO mapred.JobClient:  map 78% reduce 0%
11/09/23 20:20:52 INFO mapred.JobClient:  map 80% reduce 0%
11/09/23 20:20:55 INFO mapred.JobClient:  map 81% reduce 0%
11/09/23 20:20:58 INFO mapred.JobClient:  map 82% reduce 0%
11/09/23 20:21:01 INFO mapred.JobClient:  map 83% reduce 0%
11/09/23 20:21:04 INFO mapred.JobClient:  map 84% reduce 0%
11/09/23 20:21:07 INFO mapred.JobClient:  map 85% reduce 0%
11/09/23 20:21:10 INFO mapred.JobClient:  map 86% reduce 0%
11/09/23 20:21:13 INFO mapred.JobClient:  map 87% reduce 0%
11/09/23 20:21:22 INFO mapred.JobClient:  map 88% reduce 0%
11/09/23 20:21:28 INFO mapred.JobClient:  map 89% reduce 0%
11/09/23 20:21:37 INFO mapred.JobClient:  map 90% reduce 0%
11/09/23 20:21:47 INFO mapred.JobClient:  map 91% reduce 0%
11/09/23 20:21:53 INFO mapred.JobClient:  map 92% reduce 0%
11/09/23 20:22:02 INFO mapred.JobClient:  map 93% reduce 0%
11/09/23 20:22:11 INFO mapred.JobClient:  map 94% reduce 0%
11/09/23 20:22:17 INFO mapred.JobClient:  map 95% reduce 0%
11/09/23 20:22:26 INFO mapred.JobClient:  map 96% reduce 0%
11/09/23 20:22:32 INFO mapred.JobClient:  map 97% reduce 0%
11/09/23 20:22:41 INFO mapred.JobClient:  map 98% reduce 0%
11/09/23 20:22:47 INFO mapred.JobClient:  map 99% reduce 0%
11/09/23 20:22:53 INFO mapred.JobClient:  map 100% reduce 0%
11/09/23 20:22:55 INFO mapred.JobClient: Job complete: job_201109211521_0012
11/09/23 20:22:55 INFO mapred.JobClient: Counters: 6
11/09/23 20:22:55 INFO mapred.JobClient:   Job Counters
11/09/23 20:22:55 INFO mapred.JobClient:     Launched map tasks=4
11/09/23 20:22:55 INFO mapred.JobClient:     Data-local map tasks=4
11/09/23 20:22:55 INFO mapred.JobClient:   FileSystemCounters
11/09/23 20:22:55 INFO mapred.JobClient:     HDFS_BYTES_READ=1392402240
11/09/23 20:22:55 INFO mapred.JobClient:   Map-Reduce Framework
11/09/23 20:22:55 INFO mapred.JobClient:     Map input records=11628209
11/09/23 20:22:55 INFO mapred.JobClient:     Spilled Records=0
11/09/23 20:22:55 INFO mapred.JobClient:     Map output records=11628209
11/09/23 20:22:55 INFO mapreduce.ExportJobBase: Transferred 1.2968 GB in 425.642 seconds (3.1198 MB/sec)
11/09/23 20:22:55 INFO mapreduce.ExportJobBase: Exported 11628209 records.
Fri Sep 23 20:22:55 CST 2011

###############

[wanghai01@tc-crm-rd01.tc.baidu.com bin]$ sh import.sh
Fri Sep 23 20:40:33 CST 2011
11/09/23 20:40:33 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
11/09/23 20:40:33 INFO tool.CodeGenTool: Beginning code generation
11/09/23 20:40:33 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:40:33 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:40:33 INFO orm.CompilationManager: HADOOP_HOME is /home/wanghai01/hadoop/hadoop-0.20.2/bin/..
11/09/23 20:40:33 INFO orm.CompilationManager: Found hadoop core jar at: /home/wanghai01/hadoop/hadoop-0.20.2/bin/../hadoop-0.20.2-core.jar
11/09/23 20:40:34 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-wanghai01/compile/a913cede5621df95376a26c1af737ee2/tb_keyword_data_201104.java to /home/wanghai01/cloudera/sqoop-1.2.0-CDH3B4/bin/./tb_keyword_data_201104.java
11/09/23 20:40:34 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wanghai01/compile/a913cede5621df95376a26c1af737ee2/tb_keyword_data_201104.jar
11/09/23 20:40:34 WARN manager.MySQLManager: It looks like you are importing from mysql.
11/09/23 20:40:34 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
11/09/23 20:40:34 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
11/09/23 20:40:34 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
11/09/23 20:40:34 INFO mapreduce.ImportJobBase: Beginning import of tb_keyword_data_201104
11/09/23 20:40:34 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:40:40 INFO mapred.JobClient: Running job: job_201109211521_0014
11/09/23 20:40:41 INFO mapred.JobClient:  map 0% reduce 0%
11/09/23 20:40:54 INFO mapred.JobClient:  map 25% reduce 0%
11/09/23 20:40:57 INFO mapred.JobClient:  map 50% reduce 0%
11/09/23 20:41:36 INFO mapred.JobClient:  map 75% reduce 0%
11/09/23 20:42:00 INFO mapred.JobClient:  map 100% reduce 0%
11/09/23 20:43:19 INFO mapred.JobClient: Job complete: job_201109211521_0014
11/09/23 20:43:19 INFO mapred.JobClient: Counters: 5
11/09/23 20:43:19 INFO mapred.JobClient:   Job Counters
11/09/23 20:43:19 INFO mapred.JobClient:     Launched map tasks=4
11/09/23 20:43:19 INFO mapred.JobClient:   FileSystemCounters
11/09/23 20:43:19 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1601269219
11/09/23 20:43:19 INFO mapred.JobClient:   Map-Reduce Framework
11/09/23 20:43:19 INFO mapred.JobClient:     Map input records=11628209
11/09/23 20:43:19 INFO mapred.JobClient:     Spilled Records=0
11/09/23 20:43:19 INFO mapred.JobClient:     Map output records=11628209
11/09/23 20:43:19 INFO mapreduce.ImportJobBase: Transferred 1.4913 GB in 165.0126 seconds (9.2544 MB/sec)
11/09/23 20:43:19 INFO mapreduce.ImportJobBase: Retrieved 11628209 records.
Fri Sep 23 20:43:19 CST 2011

import.sh和export.sh中的主要命令如下

/home/wanghai01/cloudera/sqoop-1.2.0-CDH3B4/bin/sqoop import --connect jdbc:mysql://XXXX/crm --username XX --password XX --table tb_keyword_data_201104 --split-by winfo_id --target-dir /user/wanghai01/data/ --fields-terminated-by '\t' --lines-terminated-by '\n' --input-null-string '' --input-null-non-string ''
/home/wanghai01/cloudera/sqoop-1.2.0-CDH3B4/bin/sqoop export --connect jdbc:mysql://XXXX/crm --username XX --password XX --table tb_keyword_data_201104 --export-dir /user/wanghai01/data/ --fields-terminated-by '\t' --lines-terminated-by '\n' --input-null-string '' --input-null-non-string ''

相关内容