Re: small files with hive and hadoop

Ajo Fod Mon, 31 Jan 2011 10:16:10 -0800

I've noticed that it takes a while for each map job to be set up in hive ...
and the way I set up the job I noticed that there were as many maps as
files/buckets.


I read a recommendation somewhere to design jobs such that they take at
least a minute.

Cheers,
-Ajo.

On Mon, Jan 31, 2011 at 8:08 AM, <hi...@gmx.de> wrote:

> Hello,
>
> I like to do a reporting with Hive on something like tracking data.
> The raw data which is about 2 gigs or more a day I want to query with hive.
> This works already for me, no problem.
> Also I want to cascade down the reporting data to something like client,
> date, something in Hive like partitioned by (client String, date String).
> That means I have multiple aggrgation-levels. I like to do all levels in
> Hive for a consistent reporting source.
> And here is the thing: Might it a problem if it comes to many small files?
> The aggrgation level e.g. client/date might produce files about 1MB and in
> amount of 1000 a day.
> Is this a problem? I read about the "to many open files problem" with
> hadoop. And might this lead to a bad hive/map-reduce performance?
> Maybe someone has some clues for that...
>
> Thanks in advance
> labtrax
> --
> GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
> gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
>

Re: small files with hive and hadoop

Reply via email to