[ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20664:
--------------------------------
    Attachment: HIVE-20664.2.patch
        Status: Patch Available  (was: In Progress)

I have added a log statement as suggested. Also fixed a couple of issues with 
the patch.

I have tweaked the condition to try to simplify the flow, but 
{{firstStripeIndex > lastStripeIndex}} remains the same. The condition is now:
 {{if (firstStripeIndex > lastStripeIndex || firstStripeIndex == -1)}}

In the example I used in the description, if there's a 3rd stripe in the file, 
1000-1500, the lastStripeIndex will be 1 and the firstStripeIndex will be 2.

So basically the condition is: either the first stripe is found after the last 
stripe (the split is within any of the stripes of the file except the last one) 
or the first stripe is not found at all (the entire split is within the last 
stripe).

Regarding testing this fix, I tried to create a test case using 
TestInputOutputFormat.MockFileSystem to simulate multiple blocks in a file. We 
could then have a split generated per block, but things get tricky because we 
create an OrcReader in findMinMax which needs an actual file to be present. One 
way is to write an orc file with multiple stripes, then use MockFileSystem to 
simulate having multiple blocks in it and using that fs to generate the splits. 
I'm sure there is an easier/cleaner way to create such a test case - any 
suggestions?

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-20664
>                 URL: https://issues.apache.org/jira/browse/HIVE-20664
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>            Reporter: Saurabh Seth
>            Assignee: Saurabh Seth
>            Priority: Minor
>         Attachments: HIVE-20664.2.patch, HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>       StripeInformation stripe = stripes.get(i);
>       long stripeEnd = stripe.getOffset() + stripe.getLength();
>       if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
>         firstStripeIndex = i;
>       }
>       if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>           stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
>         //the last condition is for when both splitStart and splitEnd are in
>         // the same stripe
>         lastStripeIndex = i;
>       }
>     }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to