Review Request 71456: select count gives incorrect result after loading data from text file

Attila Magyar Mon, 09 Sep 2019 08:28:11 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71456/
-----------------------------------------------------------


Review request for hive, Ashutosh Chauhan, Jesús Camacho Rodríguez, and Slim 
Bouguerra.


Bugs: HIVE-22055
    https://issues.apache.org/jira/browse/HIVE-22055


Repository: hive-git


Description
-------

This happens when tez.grouping.min-size is set to a small value (for example 1) 
so that the split size that is calculated from the file size is going to be 
used. This changes as the table grows and different split sizes will be used 
while doing each selects.

load 90 records from f1
select count(1) gives back 90
load 90 records from f2
select count(1) gives back 172 // 8 records missing


When running the second select the split size is larger, and 
SerDeLowLevelCacheImpl is already populated with stripes from the first select 
(and by that tiem split size was smaller).


There is problem with how LineRecordReader works togeather with the cache. So 
if a larger split is requested and an overlapping smaller one is already in the 
cache, then SerDeEncodedDataReader'll try to extend the existing split by 
reading the 
difference between the large and the small split. But it'll start reading right 
after the last stripe pyhsically ends,
and LineRecordReader always skips the first row, unless we are at the beginning 
of the file. So this line skipping behaviour is not considered at one point and 
that's why some rows are missing.


Diffs
-----

  itests/src/test/resources/testconfiguration.properties 98280c52fe9 
  
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java
 462b25fa234 
  ql/src/test/queries/clientpositive/mm_loaddata_split_change.q PRE-CREATION 
  ql/src/test/results/clientpositive/llap/mm_loaddata_split_change.q.out 
PRE-CREATION 


Diff: https://reviews.apache.org/r/71456/diff/1/


Testing
-------

with q test


Thanks,

Attila Magyar

Review Request 71456: select count gives incorrect result after loading data from text file

Reply via email to