[ 
https://issues.apache.org/jira/browse/HIVE-15178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15178:
------------------------------------
    Attachment: HIVE-15178.patch

Not sure how it ever worked cause these settings are never propagated in merge 
task. Maybe it worked because the default split size is large and we always 
want to merge into one file.

[~prasanth_j] small patch

> ORC stripe merge may produce many MR jobs and no merge if split size is small
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-15178
>                 URL: https://issues.apache.org/jira/browse/HIVE-15178
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-15178.patch
>
>
> orc_createas1
> logs the following:
> {noformat}
> 2016-11-10T13:38:54,366  INFO [LocalJobRunner Map Task Executor #0] 
> mapred.MapTask: Processing split: 
> Paths:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0:2400+100InputFormatClass:
>  org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat
> 2016-11-10T13:38:54,373  INFO [LocalJobRunner Map Task Executor #0] 
> mapred.MapTask: Processing split: 
> Paths:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0:2500+100InputFormatClass:
>  org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat
> 2016-11-10T13:38:54,380  INFO [LocalJobRunner Map Task Executor #0] 
> mapred.MapTask: Processing split: 
> Paths:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0:2600+100InputFormatClass:
>  org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat
> 2016-11-10T13:38:54,387  INFO [LocalJobRunner Map Task Executor #0] 
> mapred.MapTask: Processing split: 
> Paths:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0:2700+100InputFormatClass:
>  org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat
> ...
> {noformat}
> It tries to merge 2 files, but instead ends up running tons of MR tasks for 
> every 100 bytes and produces 2 files again (I assume most tasks don't produce 
> the files because the split at a random 100-byte offset is invalid).
> {noformat}
> 2016-11-10T13:38:53,985  INFO [LocalJobRunner Map Task Executor #0] 
> OrcFileMergeOperator: Merged stripe from file 
> pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000000_0
>  [ offset : 3 length: 2770 row: 500 ]
> 2016-11-10T13:38:53,995  INFO [LocalJobRunner Map Task Executor #0] 
> exec.AbstractFileMergeOperator: renamed path 
> pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/_task_tmp.-ext-10002/_tmp.000002_0
>  to 
> pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/_tmp.-ext-10002/000002_0
>  . File size is 2986
> 2016-11-10T13:38:54,206  INFO [LocalJobRunner Map Task Executor #0] 
> OrcFileMergeOperator: Merged stripe from file 
> pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/-ext-10004/000001_0
>  [ offset : 3 length: 2770 row: 500 ]
> 2016-11-10T13:38:54,215  INFO [LocalJobRunner Map Task Executor #0] 
> exec.AbstractFileMergeOperator: renamed path 
> pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/_task_tmp.-ext-10002/_tmp.000030_0
>  to 
> pfile:/Users/sergey/git/hivegit2/itests/qtest/target/warehouse/.hive-staging_hive_2016-11-10_13-38-52_334_1323113125332102866-1/_tmp.-ext-10002/000030_0
>  . File size is 2986
> {noformat}
> This is because the test sets the max split size to 100. Merge jobs is 
> supposed to override that, but that doesn't happen somehow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to