[ https://issues.apache.org/jira/browse/HIVE-12877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101328#comment-15101328 ]
yangfang commented on HIVE-12877: --------------------------------- I create partitioned table and load .gz file into the partition: CREATE EXTERNAL TABLE IF NOT EXISTS if_pmt_note_staging ( apsdactno string, date_tr string, apsdjrnno string, apsdseqno string, province_code string ) partitioned by (batch_id string); alter table if_pmt_note_staging add partition (batch_id='201510') location '/hive/if_pmt_note_staging'; The location '/hive/if_pmt_note_staging' has some .gz files. such as 1.gz,2.gz and so on then I create index: CREATE INDEX index_if_pmt_note_staging_date_tr ON TABLE if_pmt_note_staging (date_tr) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD IN TABLE t_index_if_pmt_note_staging_date_tr; alter index index_if_pmt_note_staging_date_tr on if_pmt_note_staging rebuild; CREATE INDEX index_apsh_province_code_apsdprocod_apsdactno_tr ON TABLE apsh (apsdprocod) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD IN TABLE t_index_province_code_apsdprocod_apsdactno_tr; when I excute query use the the index date_tr: select * from if_pmt_note_staging where date_tr='20121205'; I found that some of the data should be queried without query. such as the matched data in the 3.gz file The hive logs print as follows: split start : 10336 split end : 59916 ................... It is true that hiveIndexResult.contains function Filter out some files in the HiveIndexedInputFormat, the function list as below: public boolean contains(FileSplit split) throws HiveException { .................... for (Long offset : bucket.getOffsets()) { if ((offset >= split.getStart()) && (offset <= split.getStart() + split.getLength())) { return true; } } } the offset length is the length of the file after decompression ,but the split.getLength() is the length of the file before decompression. so some files may filter out by this function. It seemed this section of code isn't necessary, we can delete it. > Hive use index for queries will lose some data if the Query file is > compressed. > ------------------------------------------------------------------------------- > > Key: HIVE-12877 > URL: https://issues.apache.org/jira/browse/HIVE-12877 > Project: Hive > Issue Type: Bug > Components: Indexing > Affects Versions: 1.2.1 > Environment: This problem exists in all Hive versions.no matter what > platform > Reporter: yangfang > > Hive created the index using the extracted file length when the file is the > compressed, > but when to divide the data into pieces in MapReduce,Hive use the file length > to compare with the extracted file length,if > If it found that these two lengths are not matched, It filters out the > file.So the query will lose some data -- This message was sent by Atlassian JIRA (v6.3.4#6332)