Hello, I like to do a reporting with Hive on something like tracking data. The raw data which is about 2 gigs or more a day I want to query with hive. This works already for me, no problem. Also I want to cascade down the reporting data to something like client, date, something in Hive like partitioned by (client String, date String). That means I have multiple aggrgation-levels. I like to do all levels in Hive for a consistent reporting source. And here is the thing: Might it a problem if it comes to many small files? The aggrgation level e.g. client/date might produce files about 1MB and in amount of 1000 a day. Is this a problem? I read about the "to many open files problem" with hadoop. And might this lead to a bad hive/map-reduce performance? Maybe someone has some clues for that...
Thanks in advance labtrax -- GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit gratis Handy-Flat! http://portal.gmx.net/de/go/dsl