Re: Increasing parallelism skews/increases overall job processing time linearly

Chakravarthy varaga Tue, 10 Jan 2017 02:38:39 -0800

Hi Guys,

    I understand that you are extremely busy but any pointers here is
highly appreciated. I can proceed forward towards concluding the activity !


Best Regards
CVP

On Mon, Jan 9, 2017 at 11:43 AM, Chakravarthy varaga <
chakravarth...@gmail.com> wrote:

> Anything that I could check or collect for you for investigation ?
>
> On Sat, Jan 7, 2017 at 1:35 PM, Chakravarthy varaga <
> chakravarth...@gmail.com> wrote:
>
>> Hi Stephen
>>
>> . Kafka version is: 0.9.0.1 the connector is flinkconsumer09
>> . The flatmap n coflatmap are connected by keyBy
>> . No data is broadcasted and the data is not exploded based on the
>> parallelism
>>
>> Cvp
>>
>> On 6 Jan 2017 20:16, "Stephan Ewen" <se...@apache.org> wrote:
>>
>>> Hi!
>>>
>>> You are right, parallelism 2 should be faster than parallelism 1 ;-) As
>>> ChenQin pointed out, having only 2 Kafka Partitions may prevent further
>>> scaleout.
>>>
>>> Few things to check:
>>>   - How are you connecting the FlatMap and CoFlatMap? Default, keyBy,
>>> broadcast?
>>>   - Broadcast for example would multiply the data based on parallelism,
>>> can lead to slowdown when saturating the network.
>>>   - Are you using the standard Kafka Source (which Kafka version)?
>>>   - Is there any part in the program that multiplies data/effort with
>>> higher parallelism (does the FlatMap explode data based on parallelism)?
>>>
>>> Stephan
>>>
>>>
>>> On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <qinnc...@gmail.com> wrote:
>>>
>>>> Just noticed there are only two partitions per topic. Regardless of how
>>>> large parallelism set. Only two of those will get partition assigned at
>>>> most.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 6, 2017, at 02:40, Chakravarthy varaga <chakravarth...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>     Any updates on this?
>>>>
>>>> Best Regards
>>>> CVP
>>>>
>>>> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
>>>> chakravarth...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I have a job as attached.
>>>>>
>>>>> I have a 16 Core blade running RHEL 7. The taskmanager default number
>>>>> of slots is set to 1. The source is a kafka stream and each of the 2
>>>>> sources(topic) have 2 partitions each.
>>>>>
>>>>>
>>>>> *What I notice is that when I deploy a job to run with #parallelism=2
>>>>> the total processing time doubles the time it took when the same job was
>>>>> deployed with #parallelism=1. It linearly increases with the parallelism.*
>>>>> Since the numberof slots is set to 1 per TM, I would assume that the
>>>>> job would be processed in parallel in 2 different TMs and that each
>>>>> consumer in each TM is connected to 1 partition of the topic. This
>>>>> therefore should have kept the overall processing time the same or less 
>>>>> !!!
>>>>>
>>>>> The co-flatmap connects the 2 streams & uses ValueState (checkpointed
>>>>> in FS). I think this is distributed among the TMs. My understanding is 
>>>>> that
>>>>> the search of values state could be costly between TMs.  Do you sense
>>>>> something wrong here?
>>>>>
>>>>> Best Regards
>>>>> CVP
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: Increasing parallelism skews/increases overall job processing time linearly

Reply via email to