Hello,

I like to do a reporting with Hive on something like tracking data.
The raw data which is about 2 gigs or more a day I want to query with hive. 
This works already for me, no problem.
Also I want to cascade down the reporting data to something like client, date, 
something in Hive like partitioned by (client String, date String).
That means I have multiple aggrgation-levels. I like to do all levels in Hive 
for a consistent reporting source.
And here is the thing: Might it a problem if it comes to many small files?
The aggrgation level e.g. client/date might produce files about 1MB and in 
amount of 1000 a day.
Is this a problem? I read about the "to many open files problem" with hadoop. 
And might this lead to a bad hive/map-reduce performance?
Maybe someone has some clues for that...

Thanks in advance
labtrax
-- 
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit 
gratis Handy-Flat! http://portal.gmx.net/de/go/dsl

Reply via email to