Hi Ted, All,

Unfortunately profiling turns out to be extremely slow, so it's not very
fruitful for determining what's going on here.

On the other hand I seem to have traced this problem down to the
"hive.task.progress" configuration variable. When this is set to true (as
it is automatically when a dynamic partition insert it used), the insert is
drastically slower than it is otherwise.

In SemanticAnalyzer.java it forces this task tracking on as follows:

          // turn on hive.task.progress to update # of partitions created
to the JT
          HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVEJOBPROGRESS,
true);

Does anyone know why this must be turned on? What is the need for the
number of partitions created to be reported? The end result is a lot more
than just the number of partitions having their statistics reported.

I'm not sure why the insert is so very slow when it's on, perhaps the
retrieval of the current time in millis in Operator.java:

1076   /**
1077    * this is called after operator process to buffer some counters.
1078    */
1079   private void postProcessCounter() {
1080     if (counterNameToEnum != null) {
1081       totalTime += (System.currentTimeMillis() - beginTime);
1082     }
1083   }

Thanks,
Shaun


On 6 June 2013 19:00, Ted Xu <t...@gopivotal.com> wrote:

> Hi Shaun,
>
> This is weird. I'm not sure if there is any other reasons (e.g., a very
> complex UDF?) caused this issue, but it would be the best if you can do a
> profiling<http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Profiling>,
> see if there is hot spot.
>
>
> On Thu, Jun 6, 2013 at 4:38 PM, Shaun Clowes <sclo...@atlassian.com>wrote:
>
>> Hi Ted,
>>
>> It's actually just one partition being created which is what makes it so
>> weird.
>>
>> Thanks,
>> Shaun
>>
>>
>> On 6 June 2013 18:36, Ted Xu <t...@gopivotal.com> wrote:
>>
>>> Hi Shaun,
>>>
>>> Too many partitions in dynamic partitioning may slow down the mapreduce
>>> job. Can you estimate how many partitions will be generated after insert?
>>>
>>>
>>> On Thu, Jun 6, 2013 at 4:24 PM, Shaun Clowes <sclo...@atlassian.com>wrote:
>>>
>>>> Hi All,
>>>>
>>>> Does anyone know the performance impact the dynamic partitions should
>>>> be expected to have?
>>>>
>>>> I have a table that is partitioned by a string in the form 'YYYY-MM'.
>>>> When I insert in to this table (from an external table that is just an S3
>>>> bucket containing gzipped logs) using dynamic partitioning I get very slow
>>>> performance with each node in the cluster unable to process more than 2MB
>>>> per second. When I run the exact same query with static partition values I
>>>> get more about 30-40MB/s on each node.
>>>>
>>>> I've never seen this type of problem with our internal cluster running
>>>> Hive 0.7.1 (CDH3u4), but it happens every time in EMR.
>>>>
>>>> Thanks,
>>>> Shaun
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ted Xu
>>>
>>
>>
>
>
> --
> Regards,
> Ted Xu
>

Reply via email to