Re: Increasing parallelism skews/increases overall job processing time linearly

Till Rohrmann Wed, 11 Jan 2017 05:12:13 -0800

Hi CVP,

changing the parallelism from 1 to 2 with every TM having only one slot
will inevitably introduce another network shuffle operation between the
sources and the keyed co flat map. This might be the source of your slow
down, because before everything was running on one machine without any
network communication (apart from reading from Kafka).


Do you also observe a further degradation when increasing the parallelism
from 2 to 4, for example (given that you've increased the number of topic
partitions to at least the maximum parallelism in your topology)?

Cheers,
Till

On Tue, Jan 10, 2017 at 11:37 AM, Chakravarthy varaga <
chakravarth...@gmail.com> wrote:

> Hi Guys,
>
>     I understand that you are extremely busy but any pointers here is
> highly appreciated. I can proceed forward towards concluding the activity !
>
> Best Regards
> CVP
>
> On Mon, Jan 9, 2017 at 11:43 AM, Chakravarthy varaga <
> chakravarth...@gmail.com> wrote:
>
>> Anything that I could check or collect for you for investigation ?
>>
>> On Sat, Jan 7, 2017 at 1:35 PM, Chakravarthy varaga <
>> chakravarth...@gmail.com> wrote:
>>
>>> Hi Stephen
>>>
>>> . Kafka version is: 0.9.0.1 the connector is flinkconsumer09
>>> . The flatmap n coflatmap are connected by keyBy
>>> . No data is broadcasted and the data is not exploded based on the
>>> parallelism
>>>
>>> Cvp
>>>
>>> On 6 Jan 2017 20:16, "Stephan Ewen" <se...@apache.org> wrote:
>>>
>>>> Hi!
>>>>
>>>> You are right, parallelism 2 should be faster than parallelism 1 ;-) As
>>>> ChenQin pointed out, having only 2 Kafka Partitions may prevent further
>>>> scaleout.
>>>>
>>>> Few things to check:
>>>>   - How are you connecting the FlatMap and CoFlatMap? Default, keyBy,
>>>> broadcast?
>>>>   - Broadcast for example would multiply the data based on parallelism,
>>>> can lead to slowdown when saturating the network.
>>>>   - Are you using the standard Kafka Source (which Kafka version)?
>>>>   - Is there any part in the program that multiplies data/effort with
>>>> higher parallelism (does the FlatMap explode data based on parallelism)?
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <qinnc...@gmail.com> wrote:
>>>>
>>>>> Just noticed there are only two partitions per topic. Regardless of
>>>>> how large parallelism set. Only two of those will get partition assigned 
>>>>> at
>>>>> most.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jan 6, 2017, at 02:40, Chakravarthy varaga <
>>>>> chakravarth...@gmail.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>>     Any updates on this?
>>>>>
>>>>> Best Regards
>>>>> CVP
>>>>>
>>>>> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
>>>>> chakravarth...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have a job as attached.
>>>>>>
>>>>>> I have a 16 Core blade running RHEL 7. The taskmanager default number
>>>>>> of slots is set to 1. The source is a kafka stream and each of the 2
>>>>>> sources(topic) have 2 partitions each.
>>>>>>
>>>>>>
>>>>>> *What I notice is that when I deploy a job to run with #parallelism=2
>>>>>> the total processing time doubles the time it took when the same job was
>>>>>> deployed with #parallelism=1. It linearly increases with the 
>>>>>> parallelism.*
>>>>>> Since the numberof slots is set to 1 per TM, I would assume that the
>>>>>> job would be processed in parallel in 2 different TMs and that each
>>>>>> consumer in each TM is connected to 1 partition of the topic. This
>>>>>> therefore should have kept the overall processing time the same or less 
>>>>>> !!!
>>>>>>
>>>>>> The co-flatmap connects the 2 streams & uses ValueState (checkpointed
>>>>>> in FS). I think this is distributed among the TMs. My understanding is 
>>>>>> that
>>>>>> the search of values state could be costly between TMs.  Do you sense
>>>>>> something wrong here?
>>>>>>
>>>>>> Best Regards
>>>>>> CVP
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Increasing parallelism skews/increases overall job processing time linearly

Reply via email to