Re: Hive dynamic partitions generate multiple files

2014-01-29 Thread Cosmin Cătălin Sanda
Hi Andre, I think this is indeed the direction in which I am going to go, unless anyone else has some other ideas :) *Cosmin Catalin SANDA* Software Systems Engineer Phone: +45.27.30.60.35 On Wed, Jan 29, 2014 at 10:43 AM, Andre Araujo wrote: > Hi, Cosmin

Re: Hive dynamic partitions generate multiple files

2014-01-29 Thread Andre Araujo
Hi, Cosmin, Functionally the the subsequent queries will work just fine (they will return the correct results). But you're correct in saying that it's not optimal. If the jobs always generate very small files you might end up with a huge number of small files, which will have a impact on the name

Re: Hive dynamic partitions generate multiple files

2014-01-28 Thread Cosmin Cătălin Sanda
Hi Andre, The reason is that I want those partitions to go into other queries. If the individual files are only a few MB than the performance will be sub-optimal. As far as I understood, the individual files need to be at least around 140MB for the Maps to work properly. -

Re: Hive dynamic partitions generate multiple files

2014-01-28 Thread Andre Araujo
Why do you need exactly one file? This is transparent to Hive and it should treat it seamlessly. Unless you have external requirements (reading files from somewhere else) it shouldn't matter. HDFS support to file append is not a solid standard afaik, and will depend on the distribution and version

Re: Hive dynamic partitions generate multiple files

2014-01-28 Thread Cosmin Cătălin Sanda
Hi Andre, So the thing is like this: the first time the query runs, it generates one file per dynamic partition, The next time the query runs and it needs to write to the same partition, it will generate another file instead of merging with the existing one. Eg: 1.The partitioned S3 path looks li

Re: Hive dynamic partitions generate multiple files

2014-01-28 Thread Andre Araujo
Hi, Cosmin, Have you tried using DISTRIBUTE BY to distribute the query's data by the partitioning columns? That way all the data for each partition should be sent to the same reducer and should be written to a single file in each partition, I think. If your data is being distributed by a differen

Hive dynamic partitions generate multiple files

2014-01-28 Thread Cosmin Cătălin Sanda
Hi, I have a number of Hive jobs that run during a day. Each individual job is outputting data to Amazon S3. The Hive jobs use dynamic partitioning. The problem is that when different jobs need to write to the same dynamic partition, they will each generate one file. What I would like is for th