Markus Kemper created SQOOP-3130: ------------------------------------ Summary: Sqoop (export + --export-dir + Avro files) is not obeying --num-mappers requested Key: SQOOP-3130 URL: https://issues.apache.org/jira/browse/SQOOP-3130 Project: Sqoop Issue Type: Bug Reporter: Markus Kemper
When using Sqoop (export + --export-dir + Avro files) it is not obeying --num-mappers requested, instead it is creating a map task per Avro data file. One known workaround for this issue is to use the Sqoop --hcatalog options. Please see the test case below demonstrating the issue and workaround. *Test Case* {noformat} ################# # STEP 01 - Create Data ################# for i in {1..100000}; do d=`date +"%Y-%m-%d %H:%M:%S" --date="+$i days"`; echo "$i,$d,row data" >> ./data.csv; done ls -l ./*; wc data.csv hdfs dfs -mkdir -p /user/root/external/t1 hdfs dfs -put ./data.csv /user/root/external/t1/data.csv hdfs dfs -ls -R /user/root/external/t1/ Output: -rw-r--r-- 1 root root 3488895 Feb 1 11:20 ./data.csv ~~~~~ 100000 300000 3488895 data.csv ~~~~~ -rw-r--r-- 3 root root 3488895 2017-02-01 11:26 /user/root/external/t1/data.csv ################# # STEP 02 - Create RDBMS Table and Export Data ################# export MYCONN=jdbc:oracle:thin:@oracle.cloudera.com:1521/db11g; export MYUSER=sqoop export MYPSWD=cloudera sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "drop table t1_text" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "create table t1_text (c1 int, c2 date, c3 varchar(10))" sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1 --input-fields-terminated-by ',' --num-mappers 1 sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" Output: 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Transferred 3.3274 MB in 24.8037 seconds (137.3688 KB/sec) 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Exported 100000 records. ~~~~~~ ------------------------ | COUNT(*) | ------------------------ | 100000 | ------------------------ ################# # STEP 03 - Import Data as Text Creating 10 HDFS Files ################# sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --target-dir /user/root/external/t1_text --delete-target-dir --num-mappers 10 --split-by C1 --as-textfile hdfs dfs -ls /user/root/external/t1_text/part* Output: 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Transferred 3.518 MB in 57.0517 seconds (63.1434 KB/sec) 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Retrieved 100000 records. ~~~~~ -rw-r--r-- 3 root root 358894 2017-02-01 11:38 /user/root/external/t1_text/part-m-00000 -rw-r--r-- 3 root root 370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00001 -rw-r--r-- 3 root root 370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00002 -rw-r--r-- 3 root root 370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00003 -rw-r--r-- 3 root root 370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00004 -rw-r--r-- 3 root root 370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00005 -rw-r--r-- 3 root root 370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00006 -rw-r--r-- 3 root root 370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00007 -rw-r--r-- 3 root root 370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00008 -rw-r--r-- 3 root root 370001 2017-02-01 11:38 /user/root/external/t1_text/part-m-00009 ################# # STEP 04 - Export 10 Text Formatted Data Using 2 Splits ################# sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by ',' --num-mappers 2 sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" Output: ------------------------ | COUNT(*) | ------------------------ | 0 | ------------------------ ~~~~~ 17/02/01 11:47:26 INFO input.FileInputFormat: Total input paths to process : 10 17/02/01 11:47:26 INFO mapreduce.JobSubmitter: number of splits:2 <SNIP> 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in 31.7104 seconds (113.6324 KB/sec) 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Exported 100000 records. ~~~~~ ------------------------ | COUNT(*) | ------------------------ | 100000 | ------------------------ ################# # STEP 05 - Import Data as Avro Creating 10 HDFS Files ################# sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --target-dir /user/root/external/t1_avro --delete-target-dir --num-mappers 10 --split-by C1 --as-avrodatafile hdfs dfs -ls /user/root/external/t1_avro/*.avro Output: 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Transferred 2.3703 MB in 68.454 seconds (35.4568 KB/sec) 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Retrieved 100000 records. ~~~~~ -rw-r--r-- 3 root root 231119 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00000.avro -rw-r--r-- 3 root root 250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00001.avro -rw-r--r-- 3 root root 250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00002.avro -rw-r--r-- 3 root root 250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00003.avro -rw-r--r-- 3 root root 250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00004.avro -rw-r--r-- 3 root root 250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00005.avro -rw-r--r-- 3 root root 250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00006.avro -rw-r--r-- 3 root root 250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00007.avro -rw-r--r-- 3 root root 250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00008.avro -rw-r--r-- 3 root root 250478 2017-02-01 11:56 /user/root/external/t1_avro/part-m-00009.avro ################# # STEP 06 - Export 10 Avro Formatted Data Using 2 Splits (reproduction of issue) ################# sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1_avro --input-fields-terminated-by ',' --num-mappers 2 sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" Output: 17/02/01 12:01:07 INFO input.FileInputFormat: Total input paths to process : 10 17/02/01 12:01:08 INFO mapreduce.JobSubmitter: number of splits:10 <================== not correct, should be only 2 not 10 <SNIP> 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Transferred 2.4497 MB in 57.1965 seconds (43.8575 KB/sec) 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Exported 100000 records. ~~~~~ ------------------------ | COUNT(*) | ------------------------ | 100000 | ------------------------ ################# # STEP 07 - Export 10 Avro Formatted Data Using 2 Splits Using HCat Options (workaround) ################# sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" beeline -u "jdbc:hive2://hs2.coe.cloudera.com:10000" -n user1 -e "use default; drop table t1_avro; create external table t1_avro (c1 int, c2 string, c3 string) row format delimited fields terminated by ',' stored as avro location 'hdfs:///user/root/external/t1_avro'; select count(*) from t1_avro;" sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --hcatalog-database default --hcatalog-table t1_avro --num-mappers 2 sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" Output: ------------------------ | COUNT(*) | ------------------------ | 0 | ------------------------ ~~~~~ +---------+--+ | _c0 | +---------+--+ | 100000 | +---------+--+ ~~~~~ 17/02/01 13:41:54 INFO mapred.FileInputFormat: Total input paths to process : 10 17/02/01 13:41:54 INFO mapreduce.JobSubmitter: number of splits:2 <================ correct! <SNIP> 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Transferred 2.5225 MB in 48.7286 seconds (53.0082 KB/sec) 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Exported 100000 records. ~~~~~ ------------------------ | COUNT(*) | ------------------------ | 100000 | ------------------------ {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)