On Mon, Jan 31, 2011 at 11:08 AM, <hi...@gmx.de> wrote: > Hello, > > I like to do a reporting with Hive on something like tracking data. > The raw data which is about 2 gigs or more a day I want to query with hive. > This works already for me, no problem. > Also I want to cascade down the reporting data to something like client, > date, something in Hive like partitioned by (client String, date String). > That means I have multiple aggrgation-levels. I like to do all levels in Hive > for a consistent reporting source. > And here is the thing: Might it a problem if it comes to many small files? > The aggrgation level e.g. client/date might produce files about 1MB and in > amount of 1000 a day. > Is this a problem? I read about the "to many open files problem" with hadoop. > And might this lead to a bad hive/map-reduce performance? > Maybe someone has some clues for that... > > Thanks in advance > labtrax > -- > GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit > gratis Handy-Flat! http://portal.gmx.net/de/go/dsl >
You probably do not want to partition on something that has a lot of cardinality such as client_id . You do not want many small partitions it is bad for the NameNode and mad for Map Reduce performance. So if you have 1000 client ids that is 1000+ files per day and that is trouble over a long period of time. One option is to bucket on client into 64 Buckets on client_id. hive can use the bucket to prune the amount of information that may get table-scanned for scan. It is a compromise between many files and really large files. Generally you want big files so hadoop can use brute force table scans. Edward