Re: scaling issue Running Flink on Kubernetes

Flavio Pompermaier Wed, 11 Mar 2020 02:21:49 -0700

Sorry I wanted to mention https://github.com/lyft/flinkk8soperator (I don't
know which one of the 2 is better)


On Wed, Mar 11, 2020 at 10:19 AM Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> Have you tried to use existing operators such as
> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator or
> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator?
>
> On Wed, Mar 11, 2020 at 4:46 AM Xintong Song <tonysong...@gmail.com>
> wrote:
>
>> Hi Eleanore,
>>
>> That does't sound like a scaling issue. It's probably a data skew, that
>> the data volume on some of the keys are significantly higher than others.
>> I'm not familiar with this area though, and have copied Jark for you, who
>> is one of the community experts in this area.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Wed, Mar 11, 2020 at 10:37 AM Eleanore Jin <eleanore....@gmail.com>
>> wrote:
>>
>>> _Hi Xintong,
>>>
>>> Thanks for the prompt reply! To answer your question:
>>>
>>>    - Which Flink version are you using?
>>>
>>>                v1.8.2
>>>
>>>    - Is this skew observed only after a scaling-up? What happens if the
>>>    parallelism is initially set to the scaled-up value?
>>>
>>>                I also tried this, it seems skew also happens even I do
>>> not change the parallelism, so it may not caused by scale-up/down
>>>
>>>    - Keeping the job running a while after the scale-up, does the skew
>>>    ease?
>>>
>>>                So the skew happens in such a way that: some partitions
>>> lags down to 0, but other partitions are still at level of 10_000, and I am
>>> seeing the back pressure is ok.
>>>
>>> Thanks a lot!
>>> Eleanore
>>>
>>>
>>> On Tue, Mar 10, 2020 at 7:03 PM Xintong Song <tonysong...@gmail.com>
>>> wrote:
>>>
>>>> Hi Eleanore,
>>>>
>>>> I have a few more questions regarding your issue.
>>>>
>>>>    - Which Flink version are you using?
>>>>    - Is this skew observed only after a scaling-up? What happens if
>>>>    the parallelism is initially set to the scaled-up value?
>>>>    - Keeping the job running a while after the scale-up, does the skew
>>>>    ease?
>>>>
>>>> I suspect the performance difference might be an outcome of some
>>>> warming up issues. E.g., the existing TMs might have some file already
>>>> localized, or some memory buffers already promoted to the JVM tenured area,
>>>> while the new TMs have not.
>>>>
>>>> Thank you~
>>>>
>>>> Xintong Song
>>>>
>>>>
>>>>
>>>> On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin <eleanore....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Experts,
>>>>> I have my flink application running on Kubernetes, initially with 1
>>>>> Job Manager, and 2 Task Managers.
>>>>>
>>>>> Then we have the custom operator that watches for the CRD, when the
>>>>> CRD replicas changed, it will patch the Flink Job Manager deployment
>>>>> parallelism and max parallelism according to the replicas from CRD
>>>>> (parallelism can be configured via env variables for our application).
>>>>> which causes the job manager restart. hence a new Flink job. But the
>>>>> consumer group does not change, so it will continue from the offset
>>>>> where it left.
>>>>>
>>>>> In addition, operator will also update Task Manager's deployment
>>>>> replicas, and will adjust the pod number.
>>>>>
>>>>> In case of scale up, the existing task manager pods do not get killed,
>>>>> but new task manager pods will be created.
>>>>>
>>>>> And we observed a skew in the partition offset consumed. e.g. some
>>>>> partitions have huge lags and other partitions have small lags. (observed
>>>>> from burrow)
>>>>>
>>>>> This is also validated by the metrics from Flink UI, showing the
>>>>> throughput differs for slotss
>>>>>
>>>>> Any clue why this is the case?
>>>>>
>>>>> Thanks a lot!
>>>>> Eleanore
>>>>>
>>>>
>

Re: scaling issue Running Flink on Kubernetes

Reply via email to