[ 
https://issues.apache.org/jira/browse/HIVE-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-27497:
----------------------------------
    Priority: Major  (was: Critical)

> Vector groupby operation produces incorrect result when malicious 
> configuration is intentionally set
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-27497
>                 URL: https://issues.apache.org/jira/browse/HIVE-27497
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: ConfX
>            Priority: Major
>         Attachments: reproduce.sh
>
>
> h2. What happened:
> Hive VectorGroupByOperator produces incorrect groupby results when 
> configuration parameter hive.vectorized.groupby.maxentries and 
> hive.groupby.mapaggr.checkinterval are set to some bad value, for example, 
> negative value. When Hive first in hash mode processing aggregation, the 
> malicious configuration can forcibly change processing mode to streaming 
> mode, and the cols in hash mode will be flushed out and not aggregate with 
> the cols in streaming mode, which makes the groupby result incorrect. 
> This issue is very severe because an attacker can control the value of these 
> 2 parameters to control the malicious behavior; also there should have some 
> value checkers for such parameters, at least a negative value checker for 
> these 2 parameters.
> h2. Buggy code:
> In hash mode, the vector groupby operator checks hash efficiency at the end 
> of each processBatch. 
> {code:java}
> @Override
> public void processBatch(VectorizedRowBatch batch) throws HiveException {
>   ...
>   if (this instanceof ProcessingModeHashAggregate) {
>     // Check if we should turn into streaming mode
>     ((ProcessingModeHashAggregate)this).checkHashModeEfficiency();
>   }
> }
> {code}
> In checkHashModeEfficiency, there are three if-conditions that control 
> whether to switch to streaming mode.
> {code:java}
> private void checkHashModeEfficiency() throws HiveException {
>   if (lastModeCheckRowCount > numRowsCompareHashAggr) {
>     ...
>     final long inputRecords = sumBatchSize;
>     final long outputRecords = numEntriesHashTable + 
> numFlushedOutEntriesBeforeFinalFlush;
>     final float ratio = (outputRecords) / (inputRecords * 1.0f);
>     if (ratio > minReductionHashAggr) {
>       if (inputRecords > maxHtEntries) {
>         flush(true);
>         changeToStreamingMode();
>       }
>     }
>   }
> }
> {code}
> Here numRowsCompareHashAggr is get from hive.groupby.mapaggr.checkinterval; 
> minReductionHashAggr is get from hive.vectorized.groupby.maxentries. If these 
> two configuration parameters are set to small enough values, e.g. negative 
> values, then these 2 if statements are very likely to be always true. In this 
> way, the flush will be called and changed to streaming mode.
> This mode changing somehow leads to incorrect grouby aggregation.
> Take a look at the following example, we have:
> | type   |      |     |     |      |      |      |      |
> | int    | null | 1   | 1   | null | 2    | 2    | null |
> | string | A    | A   | A   | C    | null | null | A    |
> | int    | null | 2   | 2   | null | 2    | 2    | null |
> | double | 1.0  | 2.0 | 4.0 | 8.0  | 16.0 | 32.0 | 64.0 |
> Now if we want to do a sum, the expected result would be:
> | type   |     |      |      |      |
> | int    | 1   | 2    | null | null |
> | string | A   | null | A    | C    |
> | int    | 2   | 2    | null | null |
> | double | 6.0 | 48.0 | 65.0 | 8.0  |
> However, if we intentionally set hive.vectorized.groupby.maxentries to 
> -84396285 and hive.groupby.mapaggr.checkinterval to -1152319368, the output 
> becomes:
> | type   |     |      |     |      |      |      |
> | int    | 1   | null | 1   | null | 2    | null |
> | string | A   | A    | A   | C    | null | A    |
> | int    | 2   | null | 2   | null | 2    | null |
> | double | 2.0 | 1.0  | 4.0 | 8.0  | 48.0 | 46.0 |
> This is due to the fact that in the time in processBatch, 
> checkHashModeEfficiency do flush and change to streaming, and the first two 
> cols do not do the sum operation with the later cols. Later cols do the sum 
> operation in streaming mode partially correct.
> h2. How to reproduce:
> The above example is taken from a test in Hive ql module called 
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt,
>  you can follow the steps to reproduce the bug:
> (1) Set hive.vectorized.groupby.maxentries to -84396285 and 
> hive.groupby.mapaggr.checkinterval to -1152319368
> (2) Run test 
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt.
> For an easy reproduction, run the reproduce.sh in the attachment.
> You should observe the following failure:
> {code}
> java.lang.AssertionError: [1, A, 2] expected:<6.0> but was:<2.0>              
>                                                                               
>                                            
>         at org.junit.Assert.fail(Assert.java:89)                              
>                                                                               
>                                            
>         at org.junit.Assert.failNotEquals(Assert.java:835)                    
>                
>         at org.junit.Assert.assertEquals(Assert.java:120)                     
>                
>         at 
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$ValueValidator.validate(TestVectorGroupByOperator.java:2773)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$12.inspectRow(TestVectorGroupByOperator.java:2463)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureOutputOperator.process(FakeCaptureOutputOperator.java:94)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureVectorToRowOutputOperator.process(FakeCaptureVectorToRowOutputOperator.java:162)
>         at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.flushOutput(VectorGroupByOperator.java:1347)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.closeOp(VectorGroupByOperator.java:1355)
>         at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:686)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKey(TestVectorGroupByOperator.java:2479)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKeyIntStringInt(TestVectorGroupByOperator.java:896)
> {code}
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to