Hi Andre,
I think this is indeed the direction in which I am going to go, unless
anyone else has some other ideas :)
*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35
On Wed, Jan 29, 2014 at 10:43 AM, Andre Araujo wrote:
> Hi, Cosmin
Hi, Cosmin,
Functionally the the subsequent queries will work just fine (they will
return the correct results). But you're correct in saying that it's not
optimal.
If the jobs always generate very small files you might end up with a huge
number of small files, which will have a impact on the name
Hi Andre,
The reason is that I want those partitions to go into other queries. If the
individual files are only a few MB than the performance will be
sub-optimal. As far as I understood, the individual files need to be at
least around 140MB for the Maps to work properly.
-
Why do you need exactly one file? This is transparent to Hive and it should
treat it seamlessly. Unless you have external requirements (reading files
from somewhere else) it shouldn't matter.
HDFS support to file append is not a solid standard afaik, and will depend
on the distribution and version
Hi Andre,
So the thing is like this: the first time the query runs, it generates one
file per dynamic partition, The next time the query runs and it needs to
write to the same partition, it will generate another file instead of
merging with the existing one.
Eg:
1.The partitioned S3 path looks li
Hi, Cosmin,
Have you tried using DISTRIBUTE BY to distribute the query's data by the
partitioning columns?
That way all the data for each partition should be sent to the same reducer
and should be written to a single file in each partition, I think.
If your data is being distributed by a differen
Hi,
I have a number of Hive jobs that run during a day. Each individual job is
outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
The problem is that when different jobs need to write to the same dynamic
partition, they will each generate one file.
What I would like is for th