Re: Extremely slow throughput with dynamic partitions using Hive 0.8.1 in Amazon Elastic Mapreduce

Ted Xu Mon, 17 Jun 2013 02:50:26 -0700

Hi Shaun,

Your findings are valid. Hive uses Hadoop job counters to report fatal
error, so the client can kill the MapReduce job before it completes.


With regard to your case, because Hive wants to kill the MapReduce job when
there is too many partitions using Dynamic Partitioning, counters report is
forced to enable. IMHO, fatal error report should not depend on the "job
progress" switch. You can file a JIRA ticket on this one.


On Fri, Jun 7, 2013 at 1:55 PM, Shaun Clowes <sclo...@atlassian.com> wrote:

> Hi Ted, All,
>
> Unfortunately profiling turns out to be extremely slow, so it's not very
> fruitful for determining what's going on here.
>
> On the other hand I seem to have traced this problem down to the
> "hive.task.progress" configuration variable. When this is set to true (as
> it is automatically when a dynamic partition insert it used), the insert is
> drastically slower than it is otherwise.
>
> In SemanticAnalyzer.java it forces this task tracking on as follows:
>
>           // turn on hive.task.progress to update # of partitions created
> to the JT
>           HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVEJOBPROGRESS,
> true);
>
> Does anyone know why this must be turned on? What is the need for the
> number of partitions created to be reported? The end result is a lot more
> than just the number of partitions having their statistics reported.
>
> I'm not sure why the insert is so very slow when it's on, perhaps the
> retrieval of the current time in millis in Operator.java:
>
> 1076   /**
> 1077    * this is called after operator process to buffer some counters.
> 1078    */
> 1079   private void postProcessCounter() {
> 1080     if (counterNameToEnum != null) {
> 1081       totalTime += (System.currentTimeMillis() - beginTime);
> 1082     }
> 1083   }
>
> Thanks,
> Shaun
>
>
> On 6 June 2013 19:00, Ted Xu <t...@gopivotal.com> wrote:
>
>> Hi Shaun,
>>
>> This is weird. I'm not sure if there is any other reasons (e.g., a very
>> complex UDF?) caused this issue, but it would be the best if you can do a
>> profiling<http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Profiling>,
>> see if there is hot spot.
>>
>>
>> On Thu, Jun 6, 2013 at 4:38 PM, Shaun Clowes <sclo...@atlassian.com>wrote:
>>
>>> Hi Ted,
>>>
>>> It's actually just one partition being created which is what makes it so
>>> weird.
>>>
>>> Thanks,
>>> Shaun
>>>
>>>
>>> On 6 June 2013 18:36, Ted Xu <t...@gopivotal.com> wrote:
>>>
>>>> Hi Shaun,
>>>>
>>>> Too many partitions in dynamic partitioning may slow down the mapreduce
>>>> job. Can you estimate how many partitions will be generated after insert?
>>>>
>>>>
>>>> On Thu, Jun 6, 2013 at 4:24 PM, Shaun Clowes <sclo...@atlassian.com>wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Does anyone know the performance impact the dynamic partitions should
>>>>> be expected to have?
>>>>>
>>>>> I have a table that is partitioned by a string in the form 'YYYY-MM'.
>>>>> When I insert in to this table (from an external table that is just an S3
>>>>> bucket containing gzipped logs) using dynamic partitioning I get very slow
>>>>> performance with each node in the cluster unable to process more than 2MB
>>>>> per second. When I run the exact same query with static partition values I
>>>>> get more about 30-40MB/s on each node.
>>>>>
>>>>> I've never seen this type of problem with our internal cluster running
>>>>> Hive 0.7.1 (CDH3u4), but it happens every time in EMR.
>>>>>
>>>>> Thanks,
>>>>> Shaun
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Ted Xu
>>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Ted Xu
>>
>
>


-- 
Regards,
Ted Xu

Re: Extremely slow throughput with dynamic partitions using Hive 0.8.1 in Amazon Elastic Mapreduce

Reply via email to