-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71456/
-----------------------------------------------------------
Review request for hive, Ashutosh Chauhan, Jesús Camacho Rodríguez, and Slim
Bouguerra.
Bugs: HIVE-22055
https://issues.apache.org/jira/browse/HIVE-22055
Repository: hive-git
Description
-------
This happens when tez.grouping.min-size is set to a small value (for example 1)
so that the split size that is calculated from the file size is going to be
used. This changes as the table grows and different split sizes will be used
while doing each selects.
load 90 records from f1
select count(1) gives back 90
load 90 records from f2
select count(1) gives back 172 // 8 records missing
When running the second select the split size is larger, and
SerDeLowLevelCacheImpl is already populated with stripes from the first select
(and by that tiem split size was smaller).
There is problem with how LineRecordReader works togeather with the cache. So
if a larger split is requested and an overlapping smaller one is already in the
cache, then SerDeEncodedDataReader'll try to extend the existing split by
reading the
difference between the large and the small split. But it'll start reading right
after the last stripe pyhsically ends,
and LineRecordReader always skips the first row, unless we are at the beginning
of the file. So this line skipping behaviour is not considered at one point and
that's why some rows are missing.
Diffs
-----
itests/src/test/resources/testconfiguration.properties 98280c52fe9
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java
462b25fa234
ql/src/test/queries/clientpositive/mm_loaddata_split_change.q PRE-CREATION
ql/src/test/results/clientpositive/llap/mm_loaddata_split_change.q.out
PRE-CREATION
Diff: https://reviews.apache.org/r/71456/diff/1/
Testing
-------
with q test
Thanks,
Attila Magyar