Hi Edward, thanks for looking into this. what fix 2296 does is not so good. It kind of messes with my filename, so better concatenate it as <filename>*.*copy_n.gz (rahter than <filename>*_*copy_n.gz) but that request might be considered petty... Still, what I think Sean is asking for, as well as am I, is the option to tell Hive to reject duplicate files altogether (returning an error code preferably). Could be by some addition to the syntax or a hive setup parameter, doesn't really matter. Will also look into hive query hooks as you suggested.
On Tue, Mar 20, 2012 at 3:05 PM, Edward Capriolo <[email protected]>wrote: > The copy_n should have been fixed in 0.8.0 > > https://issues.apache.org/jira/browse/HIVE-2296 > > On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara > <[email protected]> wrote: > > Gabi- > > > > Glad to know I'm not the only one scratching my head on this one! The > > changed behavior caught us off guard. > > > > I haven't found a solution in my sleuthing tonight. Indeed, any help > would > > be greatly appreciated on this! > > > > Sean > > > > From: Gabi D <[email protected]> > > Reply-To: <[email protected]> > > Date: Tue, 20 Mar 2012 10:03:04 +0200 > > To: <[email protected]> > > Subject: Re: LOAD DATA problem > > > > Hi Vikas, > > we are facing the same problem that Sean reported and have also noticed > that > > this behavior changed with a newer version of hive. Previously, when you > > inserted a file with the same name into a partition/table, hive would > fail > > the request (with yet another of its cryptic messages, an issue in > itself) > > while now it does load the file and adds the _copy_N addition to the > suffix. > > I have to say that, normally, we do not check for existance of a file > with > > the same name in our hdfs directories. Our files arrive with unique names > > and if we try to insert the same file again it is because of some > failure in > > one of the steps in our flow (e.g., files that were handled and loaded > into > > hive have not been removed from our work directory for some reason hence > in > > the next run of our load process they were reloaded). We do not want to > add > > a step that checks whether a file with the same name already exists in > hdfs > > - this is costly and most of the time (hopefully all of it) unnecessary. > > What we would like is to get some 'duplicate file' error and be able to > > disregard it, knowing that the file is already safely in its place. > > Note, that having duplicate files causes us to double count rows which is > > unacceptable for many applications. > > Moreover, we use gz files and since this behavior changes the suffix of > the > > file (from gz to gz_copy_N) when this happens we seem to be getting all > > sorts of strange data since hadoop can't recognize that this is a zipped > > file and does not decompress it before reading it ... > > Any help or suggestions on this issue would be much appreciated, we have > > been unable to find any so far. > > > > > > On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive <[email protected]> > wrote: > >> > >> hey Sean, > >> > >> its becoz you are appending the file in same partition with the same > >> name(which is not possible) you must change the file name before > appending > >> into same partition. > >> > >> AFAIK, i don't think that there is any other way to do that, either you > >> can you partition name or the file name. > >> > >> Thanks > >> Vikas Srivastava > >> > >> > >> On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara > >> <[email protected]> wrote: > >>> > >>> Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1 > >>> to logs that already exist in a partition? If the log is already in > >>> hdfs/hive I'd rather it fail and give me an return code or output > saying > >>> that the log already exists. > >>> > >>> For example, if I run these queries: > >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO > >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')" > >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO > >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')" > >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO > >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')" > >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO > >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')" > >>> > >>> I end up with: > >>> test_a.bz2 > >>> test_b.bz2 > >>> test_b_copy_1.bz2 > >>> test_b_copy_2.bz2 > >>> > >>> However, If I use OVERWRITE it will nuke all the data in the partition > >>> (including test_a.bz2) and I end up with just: > >>> test_b.bz2 > >>> > >>> I recall that older versions of hive would not do this. How do I > handle > >>> this case? Is there a safe atomic way to do this? > >>> > >>> Sean > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >> > > >
