[jira] [Updated] (HIVE-5590) select and get duplicated records with hive when a .defalte file greater than 64MB was loaded to a hive table

eye (JIRA) Fri, 18 Oct 2013 03:16:02 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-5590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


eye updated HIVE-5590:
----------------------

    Description: 
we occasionally have some compressed file larger than 160MB in .deflate format. 
And it was load to hive using an external table, say table T_A.
when select count(*) from T_A we got more records,70% more! compared with that 
we use  "hadoop fs -text /xxxxx |wc -l"  to check the file.
any clue for this? how could it happened?

the large .deflate file was due to imperfect processing , when we fixed it and 
get files less than 64M. the above problem did not come up. But since it is not 
guaranteed that a larger file would not show up again. is there any way to 
avoid this subject ?

cheers!
eye


  was:
we occasionally have some compressed file larger than 160MB in .deflate format. 
And it was load to hive using an external table, say table T_A.
when select count(*) from T_A we got more records,70% more! compared with that 
we use  "hadoop fs -text /xxxxx |wc -l"  to check the file.
any clue for this?

the large .deflate file was due to imperfect processing , when we fixed it and 
get files less than 64M. the above problem did not come up. But since it is not 
guaranteed that a larger file would not show up again. is there any way to 
avoid this subject ?

cheers!
eye



> select and get duplicated records with hive when a .defalte file greater than 
> 64MB was loaded to a hive table
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5590
>                 URL: https://issues.apache.org/jira/browse/HIVE-5590
>             Project: Hive
>          Issue Type: Bug
>         Environment: cdh4
>            Reporter: eye
>              Labels: 64M, count(*), duplited, hdfs, hive, records
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> we occasionally have some compressed file larger than 160MB in .deflate 
> format. And it was load to hive using an external table, say table T_A.
> when select count(*) from T_A we got more records,70% more! compared with 
> that we use  "hadoop fs -text /xxxxx |wc -l"  to check the file.
> any clue for this? how could it happened?
> the large .deflate file was due to imperfect processing , when we fixed it 
> and get files less than 64M. the above problem did not come up. But since it 
> is not guaranteed that a larger file would not show up again. is there any 
> way to avoid this subject ?
> cheers!
> eye



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HIVE-5590) select and get duplicated records with hive when a .defalte file greater than 64MB was loaded to a hive table

Reply via email to