[ 
https://issues.apache.org/jira/browse/HIVE-12877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101328#comment-15101328
 ] 

yangfang commented on HIVE-12877:
---------------------------------

I create partitioned table and load .gz file into the partition:
CREATE EXTERNAL TABLE IF NOT EXISTS if_pmt_note_staging (
apsdactno string, date_tr string, apsdjrnno string, apsdseqno string, 
province_code string
) partitioned by (batch_id string);

alter table if_pmt_note_staging add partition (batch_id='201510') location 
'/hive/if_pmt_note_staging';

The location '/hive/if_pmt_note_staging' has some .gz files. such as 1.gz,2.gz 
and so on

then I create index:

CREATE INDEX index_if_pmt_note_staging_date_tr 
ON TABLE if_pmt_note_staging (date_tr) 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
WITH DEFERRED REBUILD 
IN TABLE t_index_if_pmt_note_staging_date_tr; 

alter index index_if_pmt_note_staging_date_tr on if_pmt_note_staging rebuild; 

CREATE INDEX index_apsh_province_code_apsdprocod_apsdactno_tr
ON TABLE apsh (apsdprocod) 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
WITH DEFERRED REBUILD
IN TABLE t_index_province_code_apsdprocod_apsdactno_tr;

when I excute query use the the index date_tr:
select * from if_pmt_note_staging where date_tr='20121205';

I found that some of the data should be queried without query. such as the 
matched data in the 3.gz file

The hive logs print as follows:
split start : 10336
split end : 59916
...................

It is true that hiveIndexResult.contains function Filter out some files in the 
HiveIndexedInputFormat,  the function list as below:

  public boolean contains(FileSplit split) throws HiveException {
  
    ....................
    for (Long offset : bucket.getOffsets()) {
      if ((offset >= split.getStart())
          && (offset <= split.getStart() + split.getLength())) {
        return true;
      }
    }
   }
the offset length  is the length of the file after decompression ,but the 
split.getLength() is the length of the file before decompression. so some files 
may filter out by this function.
It seemed this section of code isn't necessary, we can delete it. 

> Hive use index for queries will lose some data if the Query file is 
> compressed.
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-12877
>                 URL: https://issues.apache.org/jira/browse/HIVE-12877
>             Project: Hive
>          Issue Type: Bug
>          Components: Indexing
>    Affects Versions: 1.2.1
>         Environment: This problem exists in all Hive versions.no matter what 
> platform
>            Reporter: yangfang
>
> Hive created the index using the extracted file length when the file is  the 
> compressed,
> but when to divide the data into pieces in MapReduce,Hive use the file length 
> to compare with the extracted file length,if
> If it found that these two lengths are not matched, It filters out the 
> file.So the query will lose some data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to