Isn't there an overhead associated with each map task? Based on that, my hypothesis is if I pay attention to may data, merge up small files after load, and ensure split sizes are close to files sizes, I can keep the number of map tasks to an absolute minimum.
On Tue, Sep 25, 2012 at 2:35 PM, Connell, Chuck <chuck.conn...@nuance.com>wrote: > Why do you think the current generated code is inefficient? **** > > ** ** > > ** ** > > ** ** > > *From:* John Omernik [mailto:j...@omernik.com] > *Sent:* Tuesday, September 25, 2012 2:57 PM > *To:* user@hive.apache.org > *Subject:* Hive File Sizes, Merging, and Splits**** > > ** ** > > I am really struggling trying to make hears or tails out of how to > optimize the data in my tables for best query times. I have a partition > that is compressed (Gzip) RCFile data in two files**** > > ** ** > > total 421877**** > > 263715 -rwxr-xr-x 1 darkness darkness 270044140 2012-09-25 13:32 000000_0* > *** > > 158162 -rwxr-xr-x 1 darkness darkness 161956948 2012-09-25 13:32 000001_0* > *** > > ** ** > > ** ** > > ** ** > > No matter what I set my split settings to prior to the job, I always get > three mappers. My block size is 268435456 but the setting doesn't seem to > change anything. I can set split size huge or small with no apparent affect > on the data. **** > > ** ** > > ** ** > > I know there are many esoteric items here, but is there any good > documentation on setting these things to make my queries on this data more > efficient. I am not sure what it needs three map tasks on this data, it > should really just grab two mappers. Not to mention, I thought gzip wasn't > splitable anyhow. So, from that standpoint, how does it even send data to > three mappers. If you know of some secret cache of documentation for hive, > I'd love to read it. **** > > ** ** > > Thanks**** > > ** ** >