[ https://issues.apache.org/jira/browse/HIVE-15178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-15178: ------------------------------------ Attachment: HIVE-15178.patch Not sure how it ever worked cause these settings are never propagated in merge task. Maybe it worked because the default split size is large and we always want to merge into one file. [~prasanth_j] small patch > ORC stripe merge may produce many MR jobs and no merge if split size is small > ----------------------------------------------------------------------------- > > Key: HIVE-15178 > URL: https://issues.apache.org/jira/browse/HIVE-15178 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Attachments: HIVE-15178.patch > > > orc_createas1 > logs the following: > {noformat} > 2016-11-10T13:38:54,366 INFO [LocalJobRunner Map Task Executor #0] > mapred.MapTask: Processing split: > Paths:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0:2400+100InputFormatClass: > org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat > 2016-11-10T13:38:54,373 INFO [LocalJobRunner Map Task Executor #0] > mapred.MapTask: Processing split: > Paths:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0:2500+100InputFormatClass: > org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat > 2016-11-10T13:38:54,380 INFO [LocalJobRunner Map Task Executor #0] > mapred.MapTask: Processing split: > Paths:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0:2600+100InputFormatClass: > org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat > 2016-11-10T13:38:54,387 INFO [LocalJobRunner Map Task Executor #0] > mapred.MapTask: Processing split: > Paths:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0:2700+100InputFormatClass: > org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat > ... > {noformat} > It tries to merge 2 files, but instead ends up running tons of MR tasks for > every 100 bytes and produces 2 files again (I assume most tasks don't produce > the files because the split at a random 100-byte offset is invalid). > {noformat} > 2016-11-10T13:38:53,985 INFO [LocalJobRunner Map Task Executor #0] > OrcFileMergeOperator: Merged stripe from file > pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000000_0 > [ offset : 3 length: 2770 row: 500 ] > 2016-11-10T13:38:53,995 INFO [LocalJobRunner Map Task Executor #0] > exec.AbstractFileMergeOperator: renamed path > pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/_task_tmp.-ext-10002/_tmp.000002_0 > to > pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/_tmp.-ext-10002/000002_0 > . File size is 2986 > 2016-11-10T13:38:54,206 INFO [LocalJobRunner Map Task Executor #0] > OrcFileMergeOperator: Merged stripe from file > pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0 > [ offset : 3 length: 2770 row: 500 ] > 2016-11-10T13:38:54,215 INFO [LocalJobRunner Map Task Executor #0] > exec.AbstractFileMergeOperator: renamed path > pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/_task_tmp.-ext-10002/_tmp.000030_0 > to > pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/_tmp.-ext-10002/000030_0 > . File size is 2986 > {noformat} > This is because the test sets the max split size to 100. Merge jobs is > supposed to override that, but that doesn't happen somehow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)