Hi Andre, The reason is that I want those partitions to go into other queries. If the individual files are only a few MB than the performance will be sub-optimal. As far as I understood, the individual files need to be at least around 140MB for the Maps to work properly.
------------------------------------ *Cosmin Catalin SANDA* Software Systems Engineer Phone: +45.27.30.60.35 On Wed, Jan 29, 2014 at 2:53 AM, Andre Araujo <ara...@pythian.com> wrote: > Why do you need exactly one file? This is transparent to Hive and it > should treat it seamlessly. Unless you have external requirements (reading > files from somewhere else) it shouldn't matter. > > HDFS support to file append is not a solid standard afaik, and will depend > on the distribution and version you're using. In some versions file append > is not available an the only way to add data to an existing Hive table is > to create an additional file under the table's directory in HDFS. I haven't > looked at the code but it may be that Hive developers chose this to be the > default way for appending data so it works with all HDFS distributions and > versions. > > If you need to merge multiple files under the same partition you can > select everything from that partition an INSERT OVERWRITE the data again. > > But again, unless you have requirements external to Hive, you shouldn't be > concerned about that. > > > On 29 January 2014 11:32, Cosmin Cătălin Sanda <cosmincata...@gmail.com>wrote: > >> Hi Andre, >> >> So the thing is like this: the first time the query runs, it generates >> one file per dynamic partition, The next time the query runs and it needs >> to write to the same partition, it will generate another file instead of >> merging with the existing one. >> >> Eg: >> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23 >> 2. I run the query on some data and I ultimately end up having a file in >> the above mentioned partition. >> 3. I run the same query on some other data which ends up writing to the >> same partition as above, only it doesn't take the existing file from there >> and merges with it, it will generate a second file in the same partition. >> >> >> ------------------------------------ >> *Cosmin Catalin SANDA* >> Software Systems Engineer >> Phone: +45.27.30.60.35 >> >> >> >> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ara...@pythian.com> wrote: >> >>> Hi, Cosmin, >>> >>> Have you tried using DISTRIBUTE BY to distribute the query's data by the >>> partitioning columns? >>> That way all the data for each partition should be sent to the same >>> reducer and should be written to a single file in each partition, I think. >>> >>> If your data is being distributed by a different criteria, you will >>> potentially have multiple reducers writing to the same partitions. >>> >>> Andre >>> >>> >>> >>> On 29 January 2014 10:51, Cosmin Cătălin Sanda >>> <cosmincata...@gmail.com>wrote: >>> >>>> Hi, >>>> >>>> I have a number of Hive jobs that run during a day. Each individual >>>> job is outputting data to Amazon S3. The Hive jobs use dynamic >>>> partitioning. >>>> >>>> The problem is that when different jobs need to write to the same >>>> dynamic partition, they will each generate one file. >>>> >>>> What I would like is for the subsequent jobs to load the existing data >>>> and merge it with the new data. Can this be achieved somehow? Is there an >>>> option that needs to be enabled? I already set: >>>> >>>> SET hive.merge.mapredfiles = true; >>>> SET hive.exec.dynamic.partition = true; >>>> SET hive.exec.dynamic.partition.mode = nonstrict; >>>> >>>> I should mention that the query that actually outputs to S3 is an INSERT >>>> INTO TABLE query. The Hive version is 0.8.1 >>>> >>>> >>>> Thank you, >>>> Cosmin >>>> >>> >>> >>> >>> -- >>> André Araújo >>> Big Data Consultant/Solutions Architect >>> The Pythian Group - Australia - www.pythian.com >>> >>> Office (calls from within Australia): 1300 366 021 x1270 >>> Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 >>> x1270 >>> Mobile: +61 410 323 559 >>> Fax: +61 2 9805 0544 >>> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk >>> >>> “Success is not about standing at the top, it's the steps you leave >>> behind.” — Iker Pou (rock climber) >>> >>> -- >>> >>> >>> >>> >> > > > -- > André Araújo > Big Data Consultant/Solutions Architect > The Pythian Group - Australia - www.pythian.com > > Office (calls from within Australia): 1300 366 021 x1270 > Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 > Mobile: +61 410 323 559 > Fax: +61 2 9805 0544 > IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk > > “Success is not about standing at the top, it's the steps you leave behind.” > — Iker Pou (rock climber) > > -- > > > >