[jira] [Commented] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

Hive QA (JIRA) Tue, 20 Nov 2018 05:50:11 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693256#comment-16693256
 ]


Hive QA commented on HIVE-20330:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12948838/HIVE-20330.0.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 15551 tests 
executed
*Failed tests:*
{noformat}
TestMiniDruidCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=197)
        [druidmini_masking.q,druidmini_joins.q,druid_timestamptz.q]
org.apache.hive.jdbc.TestActivePassiveHA.testActivePassiveHA (batchId=259)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/15009/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/15009/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-15009/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12948838 - PreCommit-HIVE-Build

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-20330
>                 URL: https://issues.apache.org/jira/browse/HIVE-20330
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>            Priority: Major
>         Attachments: HIVE-20330.0.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

Reply via email to