Sweet, great answers, thanks.

Indeed, I have a small number of partitions, but lots of small files, ~20MB
each. I'll make sure to combine them. Also, increasing the heap size of the
cli process already helped speed it up.

Thanks, again.


On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> The planning phase needs to do work for every hive partition and every
> hadoop files. If you have a lot of 'small' files or many partitions this
> can take a long time.
> Also the planning phase that happens on the job tracker is single threaded.
> Also the new yarn stuff requires back and forth to allocated containers.
>
> Sometimes raising the heap to for the hive-cli/launching process helps
> because the default heap of 1 GB may not be a lot of space to deal with all
> of the partition information and memory overhead will make this go faster.
> Sometimes setting the min split size higher launches less map tasks which
> speeds up everything.
>
> So the answer...Try to tune everything, start hive like this:
>
> bin/hive -hiveconf hive.root.logger=DEBUG,console
>
> And record where the longest spaces with no output are, that is what you
> should try to tune first.
>
>
>
>
> On Fri, Jul 18, 2014 at 9:36 AM, diogo <di...@uken.com> wrote:
>
>> This is probably a simple question, but I'm noticing that for queries
>> that run on 1+TB of data, it can take Hive up to 30 minutes to actually
>> start the first map-reduce stage. What is it doing? I imagine it's
>> gathering information about the data somehow, this 'startup' time is
>> clearly a function of the amount of data I'm trying to process.
>>
>> Cheers,
>>
>
>

Reply via email to