Hi Andre, I think this is indeed the direction in which I am going to go, unless anyone else has some other ideas :)
------------------------------------ *Cosmin Catalin SANDA* Software Systems Engineer Phone: +45.27.30.60.35 On Wed, Jan 29, 2014 at 10:43 AM, Andre Araujo <ara...@pythian.com> wrote: > Hi, Cosmin, > > Functionally the the subsequent queries will work just fine (they will > return the correct results). But you're correct in saying that it's not > optimal. > If the jobs always generate very small files you might end up with a huge > number of small files, which will have a impact on the name nodes memory > usage as well. > In that case I think you could periodically "coalesce" the recent > partitions. Once a week/month you can select from the more recent > partitions and insert overwrite, which will convert all those small files > in bigger ones. > > However, if the jobs are creating files that are already around the > cluster block size, it should be fine to leave them as is. > > Maybe someone else has some other ideas... > > > On 29 January 2014 18:05, Cosmin Cătălin Sanda <cosmincata...@gmail.com>wrote: > >> Hi Andre, >> >> The reason is that I want those partitions to go into other queries. If >> the individual files are only a few MB than the performance will be >> sub-optimal. As far as I understood, the individual files need to be at >> least around 140MB for the Maps to work properly. >> >> ------------------------------------ >> *Cosmin Catalin SANDA* >> Software Systems Engineer >> Phone: +45.27.30.60.35 >> >> >> >> On Wed, Jan 29, 2014 at 2:53 AM, Andre Araujo <ara...@pythian.com> wrote: >> >>> Why do you need exactly one file? This is transparent to Hive and it >>> should treat it seamlessly. Unless you have external requirements (reading >>> files from somewhere else) it shouldn't matter. >>> >>> HDFS support to file append is not a solid standard afaik, and will >>> depend on the distribution and version you're using. In some versions file >>> append is not available an the only way to add data to an existing Hive >>> table is to create an additional file under the table's directory in HDFS. >>> I haven't looked at the code but it may be that Hive developers chose this >>> to be the default way for appending data so it works with all HDFS >>> distributions and versions. >>> >>> If you need to merge multiple files under the same partition you can >>> select everything from that partition an INSERT OVERWRITE the data again. >>> >>> But again, unless you have requirements external to Hive, you shouldn't >>> be concerned about that. >>> >>> >>> On 29 January 2014 11:32, Cosmin Cătălin Sanda >>> <cosmincata...@gmail.com>wrote: >>> >>>> Hi Andre, >>>> >>>> So the thing is like this: the first time the query runs, it generates >>>> one file per dynamic partition, The next time the query runs and it needs >>>> to write to the same partition, it will generate another file instead of >>>> merging with the existing one. >>>> >>>> Eg: >>>> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23 >>>> 2. I run the query on some data and I ultimately end up having a file >>>> in the above mentioned partition. >>>> 3. I run the same query on some other data which ends up writing to the >>>> same partition as above, only it doesn't take the existing file from there >>>> and merges with it, it will generate a second file in the same partition. >>>> >>>> >>>> ------------------------------------ >>>> *Cosmin Catalin SANDA* >>>> Software Systems Engineer >>>> Phone: +45.27.30.60.35 >>>> >>>> >>>> >>>> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ara...@pythian.com>wrote: >>>> >>>>> Hi, Cosmin, >>>>> >>>>> Have you tried using DISTRIBUTE BY to distribute the query's data by >>>>> the partitioning columns? >>>>> That way all the data for each partition should be sent to the same >>>>> reducer and should be written to a single file in each partition, I think. >>>>> >>>>> If your data is being distributed by a different criteria, you will >>>>> potentially have multiple reducers writing to the same partitions. >>>>> >>>>> Andre >>>>> >>>>> >>>>> >>>>> On 29 January 2014 10:51, Cosmin Cătălin Sanda < >>>>> cosmincata...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have a number of Hive jobs that run during a day. Each individual >>>>>> job is outputting data to Amazon S3. The Hive jobs use dynamic >>>>>> partitioning. >>>>>> >>>>>> The problem is that when different jobs need to write to the same >>>>>> dynamic partition, they will each generate one file. >>>>>> >>>>>> What I would like is for the subsequent jobs to load the existing >>>>>> data and merge it with the new data. Can this be achieved somehow? Is >>>>>> there >>>>>> an option that needs to be enabled? I already set: >>>>>> >>>>>> SET hive.merge.mapredfiles = true; >>>>>> SET hive.exec.dynamic.partition = true; >>>>>> SET hive.exec.dynamic.partition.mode = nonstrict; >>>>>> >>>>>> I should mention that the query that actually outputs to S3 is an INSERT >>>>>> INTO TABLE query. The Hive version is 0.8.1 >>>>>> >>>>>> >>>>>> Thank you, >>>>>> Cosmin >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> André Araújo >>>>> Big Data Consultant/Solutions Architect >>>>> The Pythian Group - Australia - www.pythian.com >>>>> >>>>> Office (calls from within Australia): 1300 366 021 x1270 >>>>> Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 >>>>> x1270 >>>>> Mobile: +61 410 323 559 >>>>> Fax: +61 2 9805 0544 >>>>> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk >>>>> >>>>> “Success is not about standing at the top, it's the steps you leave >>>>> behind.” — Iker Pou (rock climber) >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> André Araújo >>> Big Data Consultant/Solutions Architect >>> The Pythian Group - Australia - www.pythian.com >>> >>> Office (calls from within Australia): 1300 366 021 x1270 >>> Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 >>> x1270 >>> Mobile: +61 410 323 559 >>> Fax: +61 2 9805 0544 >>> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk >>> >>> “Success is not about standing at the top, it's the steps you leave >>> behind.” — Iker Pou (rock climber) >>> >>> -- >>> >>> >>> >>> >> > > > -- > André Araújo > Big Data Consultant/Solutions Architect > The Pythian Group - Australia - www.pythian.com > > Office (calls from within Australia): 1300 366 021 x1270 > Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 > Mobile: +61 410 323 559 > Fax: +61 2 9805 0544 > IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk > > “Success is not about standing at the top, it's the steps you leave behind.” > — Iker Pou (rock climber) > > -- > > > >