Isn't there an overhead associated with each map task?  Based on that, my
hypothesis is if I pay attention to may data, merge up small files after
load, and ensure split sizes are close to files sizes, I can keep the
number of map tasks to an absolute minimum.


On Tue, Sep 25, 2012 at 2:35 PM, Connell, Chuck <chuck.conn...@nuance.com>wrote:

>  Why do you think the current generated code is inefficient? ****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* John Omernik [mailto:j...@omernik.com]
> *Sent:* Tuesday, September 25, 2012 2:57 PM
> *To:* user@hive.apache.org
> *Subject:* Hive File Sizes, Merging, and Splits****
>
> ** **
>
> I am really struggling trying to make hears or tails out of how to
> optimize the data in my tables for best query times.  I have a partition
> that is compressed (Gzip) RCFile data in two files****
>
> ** **
>
> total 421877****
>
> 263715 -rwxr-xr-x 1 darkness darkness 270044140 2012-09-25 13:32 000000_0*
> ***
>
> 158162 -rwxr-xr-x 1 darkness darkness 161956948 2012-09-25 13:32 000001_0*
> ***
>
> ** **
>
> ** **
>
> ** **
>
> No matter what I set my split settings to prior to the job, I always get
> three mappers.  My block size is 268435456 but the setting doesn't seem to
> change anything. I can set split size huge or small with no apparent affect
> on the data.   ****
>
> ** **
>
> ** **
>
> I know there are many esoteric items here, but is there any good
> documentation on setting these things to make my queries on this data more
> efficient. I am not sure what it needs three map tasks on this data, it
> should really just grab two mappers. Not to mention, I thought gzip wasn't
> splitable anyhow.  So, from that standpoint, how does it even send data to
> three mappers.  If you know of some secret cache of documentation for hive,
> I'd love to read it. ****
>
> ** **
>
> Thanks****
>
> ** **
>

Reply via email to