I've noticed that it takes a while for each map job to be set up in hive ... and the way I set up the job I noticed that there were as many maps as files/buckets.
I read a recommendation somewhere to design jobs such that they take at least a minute. Cheers, -Ajo. On Mon, Jan 31, 2011 at 8:08 AM, <hi...@gmx.de> wrote: > Hello, > > I like to do a reporting with Hive on something like tracking data. > The raw data which is about 2 gigs or more a day I want to query with hive. > This works already for me, no problem. > Also I want to cascade down the reporting data to something like client, > date, something in Hive like partitioned by (client String, date String). > That means I have multiple aggrgation-levels. I like to do all levels in > Hive for a consistent reporting source. > And here is the thing: Might it a problem if it comes to many small files? > The aggrgation level e.g. client/date might produce files about 1MB and in > amount of 1000 a day. > Is this a problem? I read about the "to many open files problem" with > hadoop. And might this lead to a bad hive/map-reduce performance? > Maybe someone has some clues for that... > > Thanks in advance > labtrax > -- > GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit > gratis Handy-Flat! http://portal.gmx.net/de/go/dsl >