[ https://issues.apache.org/jira/browse/HIVE-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denys Kuzmenko updated HIVE-27497: ---------------------------------- Priority: Major (was: Critical) > Vector groupby operation produces incorrect result when malicious > configuration is intentionally set > ---------------------------------------------------------------------------------------------------- > > Key: HIVE-27497 > URL: https://issues.apache.org/jira/browse/HIVE-27497 > Project: Hive > Issue Type: Bug > Components: Hive > Reporter: ConfX > Priority: Major > Attachments: reproduce.sh > > > h2. What happened: > Hive VectorGroupByOperator produces incorrect groupby results when > configuration parameter hive.vectorized.groupby.maxentries and > hive.groupby.mapaggr.checkinterval are set to some bad value, for example, > negative value. When Hive first in hash mode processing aggregation, the > malicious configuration can forcibly change processing mode to streaming > mode, and the cols in hash mode will be flushed out and not aggregate with > the cols in streaming mode, which makes the groupby result incorrect. > This issue is very severe because an attacker can control the value of these > 2 parameters to control the malicious behavior; also there should have some > value checkers for such parameters, at least a negative value checker for > these 2 parameters. > h2. Buggy code: > In hash mode, the vector groupby operator checks hash efficiency at the end > of each processBatch. > {code:java} > @Override > public void processBatch(VectorizedRowBatch batch) throws HiveException { > ... > if (this instanceof ProcessingModeHashAggregate) { > // Check if we should turn into streaming mode > ((ProcessingModeHashAggregate)this).checkHashModeEfficiency(); > } > } > {code} > In checkHashModeEfficiency, there are three if-conditions that control > whether to switch to streaming mode. > {code:java} > private void checkHashModeEfficiency() throws HiveException { > if (lastModeCheckRowCount > numRowsCompareHashAggr) { > ... > final long inputRecords = sumBatchSize; > final long outputRecords = numEntriesHashTable + > numFlushedOutEntriesBeforeFinalFlush; > final float ratio = (outputRecords) / (inputRecords * 1.0f); > if (ratio > minReductionHashAggr) { > if (inputRecords > maxHtEntries) { > flush(true); > changeToStreamingMode(); > } > } > } > } > {code} > Here numRowsCompareHashAggr is get from hive.groupby.mapaggr.checkinterval; > minReductionHashAggr is get from hive.vectorized.groupby.maxentries. If these > two configuration parameters are set to small enough values, e.g. negative > values, then these 2 if statements are very likely to be always true. In this > way, the flush will be called and changed to streaming mode. > This mode changing somehow leads to incorrect grouby aggregation. > Take a look at the following example, we have: > | type | | | | | | | | > | int | null | 1 | 1 | null | 2 | 2 | null | > | string | A | A | A | C | null | null | A | > | int | null | 2 | 2 | null | 2 | 2 | null | > | double | 1.0 | 2.0 | 4.0 | 8.0 | 16.0 | 32.0 | 64.0 | > Now if we want to do a sum, the expected result would be: > | type | | | | | > | int | 1 | 2 | null | null | > | string | A | null | A | C | > | int | 2 | 2 | null | null | > | double | 6.0 | 48.0 | 65.0 | 8.0 | > However, if we intentionally set hive.vectorized.groupby.maxentries to > -84396285 and hive.groupby.mapaggr.checkinterval to -1152319368, the output > becomes: > | type | | | | | | | > | int | 1 | null | 1 | null | 2 | null | > | string | A | A | A | C | null | A | > | int | 2 | null | 2 | null | 2 | null | > | double | 2.0 | 1.0 | 4.0 | 8.0 | 48.0 | 46.0 | > This is due to the fact that in the time in processBatch, > checkHashModeEfficiency do flush and change to streaming, and the first two > cols do not do the sum operation with the later cols. Later cols do the sum > operation in streaming mode partially correct. > h2. How to reproduce: > The above example is taken from a test in Hive ql module called > org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt, > you can follow the steps to reproduce the bug: > (1) Set hive.vectorized.groupby.maxentries to -84396285 and > hive.groupby.mapaggr.checkinterval to -1152319368 > (2) Run test > org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt. > For an easy reproduction, run the reproduce.sh in the attachment. > You should observe the following failure: > {code} > java.lang.AssertionError: [1, A, 2] expected:<6.0> but was:<2.0> > > > at org.junit.Assert.fail(Assert.java:89) > > > at org.junit.Assert.failNotEquals(Assert.java:835) > > at org.junit.Assert.assertEquals(Assert.java:120) > > at > org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$ValueValidator.validate(TestVectorGroupByOperator.java:2773) > at > org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$12.inspectRow(TestVectorGroupByOperator.java:2463) > at > org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureOutputOperator.process(FakeCaptureOutputOperator.java:94) > at > org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureVectorToRowOutputOperator.process(FakeCaptureVectorToRowOutputOperator.java:162) > at > org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.flushOutput(VectorGroupByOperator.java:1347) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.closeOp(VectorGroupByOperator.java:1355) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:686) > at > org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKey(TestVectorGroupByOperator.java:2479) > at > org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKeyIntStringInt(TestVectorGroupByOperator.java:896) > {code} > We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010)