On Mon, Jan 31, 2011 at 11:08 AM,  <hi...@gmx.de> wrote:
> Hello,
>
> I like to do a reporting with Hive on something like tracking data.
> The raw data which is about 2 gigs or more a day I want to query with hive. 
> This works already for me, no problem.
> Also I want to cascade down the reporting data to something like client, 
> date, something in Hive like partitioned by (client String, date String).
> That means I have multiple aggrgation-levels. I like to do all levels in Hive 
> for a consistent reporting source.
> And here is the thing: Might it a problem if it comes to many small files?
> The aggrgation level e.g. client/date might produce files about 1MB and in 
> amount of 1000 a day.
> Is this a problem? I read about the "to many open files problem" with hadoop. 
> And might this lead to a bad hive/map-reduce performance?
> Maybe someone has some clues for that...
>
> Thanks in advance
> labtrax
> --
> GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
> gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
>

You probably do not want to partition on something that has a lot of
cardinality such as client_id . You do not want many small partitions
it is bad for the NameNode and mad for Map Reduce performance. So if
you have 1000 client ids that is 1000+ files per day and that is
trouble over a long period of time.

One option is to bucket on client into 64 Buckets on client_id. hive
can use the bucket to prune the amount of information that may get
table-scanned for scan. It is a compromise between many files and
really large files.

Generally you want big files so hadoop can use brute force table scans.

Edward

Reply via email to