The copy_n should have been fixed in 0.8.0 https://issues.apache.org/jira/browse/HIVE-2296
On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara <sean.mcnam...@webtrends.com> wrote: > Gabi- > > Glad to know I'm not the only one scratching my head on this one! The > changed behavior caught us off guard. > > I haven't found a solution in my sleuthing tonight. Indeed, any help would > be greatly appreciated on this! > > Sean > > From: Gabi D <gabi...@gmail.com> > Reply-To: <user@hive.apache.org> > Date: Tue, 20 Mar 2012 10:03:04 +0200 > To: <user@hive.apache.org> > Subject: Re: LOAD DATA problem > > Hi Vikas, > we are facing the same problem that Sean reported and have also noticed that > this behavior changed with a newer version of hive. Previously, when you > inserted a file with the same name into a partition/table, hive would fail > the request (with yet another of its cryptic messages, an issue in itself) > while now it does load the file and adds the _copy_N addition to the suffix. > I have to say that, normally, we do not check for existance of a file with > the same name in our hdfs directories. Our files arrive with unique names > and if we try to insert the same file again it is because of some failure in > one of the steps in our flow (e.g., files that were handled and loaded into > hive have not been removed from our work directory for some reason hence in > the next run of our load process they were reloaded). We do not want to add > a step that checks whether a file with the same name already exists in hdfs > - this is costly and most of the time (hopefully all of it) unnecessary. > What we would like is to get some 'duplicate file' error and be able to > disregard it, knowing that the file is already safely in its place. > Note, that having duplicate files causes us to double count rows which is > unacceptable for many applications. > Moreover, we use gz files and since this behavior changes the suffix of the > file (from gz to gz_copy_N) when this happens we seem to be getting all > sorts of strange data since hadoop can't recognize that this is a zipped > file and does not decompress it before reading it ... > Any help or suggestions on this issue would be much appreciated, we have > been unable to find any so far. > > > On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive <hadooph...@gmail.com> wrote: >> >> hey Sean, >> >> its becoz you are appending the file in same partition with the same >> name(which is not possible) you must change the file name before appending >> into same partition. >> >> AFAIK, i don't think that there is any other way to do that, either you >> can you partition name or the file name. >> >> Thanks >> Vikas Srivastava >> >> >> On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara >> <sean.mcnam...@webtrends.com> wrote: >>> >>> Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1 >>> to logs that already exist in a partition? If the log is already in >>> hdfs/hive I'd rather it fail and give me an return code or output saying >>> that the log already exists. >>> >>> For example, if I run these queries: >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')" >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')" >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')" >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')" >>> >>> I end up with: >>> test_a.bz2 >>> test_b.bz2 >>> test_b_copy_1.bz2 >>> test_b_copy_2.bz2 >>> >>> However, If I use OVERWRITE it will nuke all the data in the partition >>> (including test_a.bz2) and I end up with just: >>> test_b.bz2 >>> >>> I recall that older versions of hive would not do this. How do I handle >>> this case? Is there a safe atomic way to do this? >>> >>> Sean >>> >>> >>> >>> >>> >>> >>> >> >