[ https://issues.apache.org/jira/browse/HIVE-5590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
eye updated HIVE-5590: ---------------------- Description: we occasionally have some compressed file larger than 160MB in .deflate format. And it was load to hive using an external table, say table T_A. when select count(*) from T_A we got more records,70% more! compared with that we use "hadoop fs -text /xxxxx |wc -l" to check the file. any clue for this? how could it happened? the large .deflate file was due to imperfect processing , when we fixed it and get files less than 64M. the above problem did not come up. But since it is not guaranteed that a larger file would not show up again. is there any way to avoid this subject ? cheers! eye was: we occasionally have some compressed file larger than 160MB in .deflate format. And it was load to hive using an external table, say table T_A. when select count(*) from T_A we got more records,70% more! compared with that we use "hadoop fs -text /xxxxx |wc -l" to check the file. any clue for this? the large .deflate file was due to imperfect processing , when we fixed it and get files less than 64M. the above problem did not come up. But since it is not guaranteed that a larger file would not show up again. is there any way to avoid this subject ? cheers! eye > select and get duplicated records with hive when a .defalte file greater than > 64MB was loaded to a hive table > ------------------------------------------------------------------------------------------------------------- > > Key: HIVE-5590 > URL: https://issues.apache.org/jira/browse/HIVE-5590 > Project: Hive > Issue Type: Bug > Environment: cdh4 > Reporter: eye > Labels: 64M, count(*), duplited, hdfs, hive, records > Original Estimate: 48h > Remaining Estimate: 48h > > we occasionally have some compressed file larger than 160MB in .deflate > format. And it was load to hive using an external table, say table T_A. > when select count(*) from T_A we got more records,70% more! compared with > that we use "hadoop fs -text /xxxxx |wc -l" to check the file. > any clue for this? how could it happened? > the large .deflate file was due to imperfect processing , when we fixed it > and get files less than 64M. the above problem did not come up. But since it > is not guaranteed that a larger file would not show up again. is there any > way to avoid this subject ? > cheers! > eye -- This message was sent by Atlassian JIRA (v6.1#6144)