Markus Kemper created SQOOP-3046: ------------------------------------ Summary: Add support for (import + --hcatalog* + --as-parquetfile) Key: SQOOP-3046 URL: https://issues.apache.org/jira/browse/SQOOP-3046 Project: Sqoop Issue Type: Improvement Components: hive-integration Reporter: Markus Kemper
This is a request to identify a way to support Sqoop import with --hcatalog options when writing Parquet data files. The test case below demonstrates the issue. CODE SNIP {noformat} ../MapredParquetOutputFormat.java 69 @Override 70 public RecordWriter<Void, ParquetHiveRecord> getRecordWriter( 71 final FileSystem ignored, 72 final JobConf job, 73 final String name, 74 final Progressable progress 75 ) throws IOException { 76 throw new RuntimeException("Should never be used"); 77 } {noformat} TEST CASE: {noformat} STEP 01 - Create MySQL Tables sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "drop table t1" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "create table t1 (c_int int, c_date date, c_timestamp timestamp)" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "describe t1" --------------------------------------------------------------------------------------------------------- | Field | Type | Null | Key | Default | Extra | --------------------------------------------------------------------------------------------------------- | c_int | int(11) | YES | | (null) | | | c_date | date | YES | | (null) | | | c_timestamp | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP | --------------------------------------------------------------------------------------------------------- STEP 02 : Insert and Select Row sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "insert into t1 values (1, current_date(), current_timestamp())" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select * from t1" -------------------------------------------------- | c_int | c_date | c_timestamp | -------------------------------------------------- | 1 | 2016-10-26 | 2016-10-26 14:30:33.0 | -------------------------------------------------- beeline -u jdbc:hive2:// -e "use default; drop table t1" sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username $MYUSER --password $MYPSWD --table t1 --hcatalog-database default --hcatalog-table t1 --create-hcatalog-table --hcatalog-storage-stanza 'stored as parquet' --num-mappers 1 [sqoop console debug] 16/11/02 20:25:15 INFO mapreduce.Job: Task Id : attempt_1478089149450_0046_m_000000_0, Status : FAILED Error: java.lang.RuntimeException: Should never be used at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:76) at org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:102) at org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) [yarn maptask debug] 2016-11-02 20:25:15,565 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: 1=1 AND 1=1 2016-11-02 20:25:15,583 DEBUG [main] org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat: Creating db record reader for db product: MYSQL 2016-11-02 20:25:15,613 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1 2016-11-02 20:25:15,614 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 2016-11-02 20:25:15,620 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 2016-11-02 20:25:15,633 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Should never be used at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:76) at org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:102) at org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)