Hive File Sizes, Merging, and Splits

John Omernik Tue, 25 Sep 2012 11:57:40 -0700

I am really struggling trying to make hears or tails out of how to optimize
the data in my tables for best query times.  I have a partition that is
compressed (Gzip) RCFile data in two files


total 421877
263715 -rwxr-xr-x 1 darkness darkness 270044140 2012-09-25 13:32 000000_0
158162 -rwxr-xr-x 1 darkness darkness 161956948 2012-09-25 13:32 000001_0



No matter what I set my split settings to prior to the job, I always get
three mappers.  My block size is 268435456 but the setting doesn't seem to
change anything. I can set split size huge or small with no apparent affect
on the data.


I know there are many esoteric items here, but is there any good
documentation on setting these things to make my queries on this data more
efficient. I am not sure what it needs three map tasks on this data, it
should really just grab two mappers. Not to mention, I thought gzip wasn't
splitable anyhow.  So, from that standpoint, how does it even send data to
three mappers.  If you know of some secret cache of documentation for hive,
I'd love to read it.

Thanks

Hive File Sizes, Merging, and Splits

Reply via email to