Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise Thu, 30 Jul 2020 20:29:13 -0700

I run git bisect and the first commit that shows the regression is:

https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90



On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[email protected]> wrote:

> From my experience, java profilers are sometimes not accurate enough to
> find out the performance regression
> root cause. In this case, I would suggest you try out intel vtune amplifier
> to watch more detailed metrics.
>
> Best,
> Kurt
>
>
> On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[email protected]> wrote:
>
> > The cause of the issue is all but clear.
> >
> > Previously I had mentioned that there is no suspect change to the Kinesis
> > connector and that I had reverted the AWS SDK change to no effect.
> >
> > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed another
> > regression in the previous release and is present before and after.
> >
> > I repeated the run with 1.11.0 core and downgraded the entire Kinesis
> > connector to 1.10.1: Nothing changes, i.e. the regression is still
> present.
> > Therefore we will need to look elsewhere for the root cause.
> >
> > Regarding the time spent in snapshotState, repeat runs reveal a wide
> range
> > for both versions, 1.10 and 1.11. So again this is nothing pointing to a
> > root cause.
> >
> > At this point, I have no ideas remaining other than doing a bisect to
> find
> > the culprit. Any other suggestions?
> >
> > Thomas
> >
> >
> > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <[email protected]
> > .invalid>
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > Thanks for your further profiling information and glad to see we
> already
> > > finalized the location to cause the regression.
> > > Actually I was also suspicious of the point of #snapshotState in
> previous
> > > discussions since it indeed cost much time to block normal operator
> > > processing.
> > >
> > > Based on your below feedback, the sleep time during #snapshotState
> might
> > > be the main concern, and I also digged into the implementation of
> > > FlinkKinesisProducer#snapshotState.
> > > while (producer.getOutstandingRecordsCount() > 0) {
> > >    producer.flush();
> > >    try {
> > >       Thread.sleep(500);
> > >    } catch (InterruptedException e) {
> > >       LOG.warn("Flushing was interrupted.");
> > >       break;
> > >    }
> > > }
> > > It seems that the sleep time is mainly affected by the internal
> > operations
> > > inside KinesisProducer implementation provided by amazonaws, which I am
> > not
> > > quite familiar with.
> > > But I noticed there were two upgrades related to it in release-1.11.0.
> > One
> > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is
> for
> > > upgrading aws-sdk-version to 1.11.754 [2].
> > > You mentioned that you already reverted the SDK upgrade to verify no
> > > changes. Did you also revert the [1] to verify?
> > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> > >
> > > Best,
> > > Zhijiang
> > > ------------------------------------------------------------------
> > > From:Thomas Weise <[email protected]>
> > > Send Time:2020年7月17日(星期五) 05:29
> > > To:dev <[email protected]>
> > > Cc:Zhijiang <[email protected]>; Stephan Ewen <
> [email protected]
> > >;
> > > Arvid Heise <[email protected]>; Aljoscha Krettek <
> [email protected]
> > >
> > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> release
> > > candidate #4)
> > >
> > > Sorry for the delay.
> > >
> > > I confirmed that the regression is due to the sink (unsurprising, since
> > > another job with the same consumer, but not the producer, runs as
> > > expected).
> > >
> > > As promised I did CPU profiling on the problematic application, which
> > gives
> > > more insight into the regression [1]
> > >
> > > The screenshots show that the average time for snapshotState increases
> > from
> > > ~9s to ~28s. The data also shows the increase in sleep time during
> > > snapshotState.
> > >
> > > Does anyone, based on changes made in 1.11, have a theory why?
> > >
> > > I had previously looked at the changes to the Kinesis connector and
> also
> > > reverted the SDK upgrade, which did not change the situation.
> > >
> > > It will likely be necessary to drill into the sink / checkpointing
> > details
> > > to understand the cause of the problem.
> > >
> > > Let me know if anyone has specific questions that I can answer from the
> > > profiling results.
> > >
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> > >
> > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[email protected]> wrote:
> > >
> > > > + dev@ for visibility
> > > >
> > > > I will investigate further today.
> > > >
> > > >
> > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <[email protected]
> >
> > > > wrote:
> > > >
> > > >> On 06.07.20 20:39, Stephan Ewen wrote:
> > > >> >    - Did sink checkpoint notifications change in a relevant way,
> for
> > > >> example
> > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
> > > >>
> > > >> I think that's unrelated: the Kafka fixes were isolated in Kafka and
> > the
> > > >> one bug I discovered on the way was about the Task reaper.
> > > >>
> > > >>
> > > >> On 07.07.20 17:51, Zhijiang wrote:
> > > >> > Sorry for my misunderstood of the previous information, Thomas. I
> > was
> > > >> assuming that the sync checkpoint duration increased after upgrade
> as
> > it
> > > >> was mentioned before.
> > > >> >
> > > >> > If I remembered correctly, the memory state backend also has the
> > same
> > > >> issue? If so, we can dismiss the rocksDB state changes. As the slot
> > > sharing
> > > >> enabled, the downstream and upstream should
> > > >> > probably deployed into the same slot, then no network shuffle
> > effect.
> > > >> >
> > > >> > I think we need to find out whether it has other symptoms changed
> > > >> besides the performance regression to further figure out the scope.
> > > >> > E.g. any metrics changes, the number of TaskManager and the number
> > of
> > > >> slots per TaskManager from deployment changes.
> > > >> > 40% regression is really big, I guess the changes should also be
> > > >> reflected in other places.
> > > >> >
> > > >> > I am not sure whether we can reproduce the regression in our AWS
> > > >> environment by writing any Kinesis jobs, since there are also normal
> > > >> Kinesis jobs as Thomas mentioned after upgrade.
> > > >> > So it probably looks like to touch some corner case. I am very
> > willing
> > > >> to provide any help for debugging if possible.
> > > >> >
> > > >> >
> > > >> > Best,
> > > >> > Zhijiang
> > > >> >
> > > >> >
> > > >> > ------------------------------------------------------------------
> > > >> > From:Thomas Weise <[email protected]>
> > > >> > Send Time:2020年7月7日(星期二) 23:01
> > > >> > To:Stephan Ewen <[email protected]>
> > > >> > Cc:Aljoscha Krettek <[email protected]>; Arvid Heise <
> > > >> [email protected]>; Zhijiang <[email protected]>
> > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > > >> release candidate #4)
> > > >> >
> > > >> > We are deploying our apps with FlinkK8sOperator. We have one job
> > that
> > > >> works as expected after the upgrade and the one discussed here that
> > has
> > > the
> > > >> performance regression.
> > > >> >
> > > >> > "The performance regression is obvious caused by long duration of
> > sync
> > > >> checkpoint process in Kinesis sink operator, which would block the
> > > normal
> > > >> data processing until back pressure the source."
> > > >> >
> > > >> > That's a constant. Before (1.10) and upgrade have the same sync
> > > >> checkpointing time. The question is what change came in with the
> > > upgrade.
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[email protected]>
> > wrote:
> > > >> >
> > > >> > @Thomas Just one thing real quick: Are you using the standalone
> > setup
> > > >> scripts (like start-cluster.sh, and the former "slaves" file) ?
> > > >> > Be aware that this is now called "workers" because of avoiding
> > > >> sensitive names.
> > > >> > In one internal benchmark we saw quite a lot of slowdown
> initially,
> > > >> before seeing that the cluster was not a distributed cluster any
> more
> > > ;-)
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
> [email protected]
> > >
> > > >> wrote:
> > > >> > Thanks for this kickoff and help analysis, Stephan!
> > > >> > Thanks for the further feedback and investigation, Thomas!
> > > >> >
> > > >> > The performance regression is obvious caused by long duration of
> > sync
> > > >> checkpoint process in Kinesis sink operator, which would block the
> > > normal
> > > >> data processing until back pressure the source.
> > > >> > Maybe we could dig into the process of sync execution in
> checkpoint.
> > > >> E.g. break down the steps inside respective operator#snapshotState
> to
> > > >> statistic which operation cost most of the time, then
> > > >> > we might probably find the root cause to bring such cost.
> > > >> >
> > > >> > Look forward to the further progress. :)
> > > >> >
> > > >> > Best,
> > > >> > Zhijiang
> > > >> >
> > > >> > ------------------------------------------------------------------
> > > >> > From:Stephan Ewen <[email protected]>
> > > >> > Send Time:2020年7月7日(星期二) 14:52
> > > >> > To:Thomas Weise <[email protected]>
> > > >> > Cc:Stephan Ewen <[email protected]>; Zhijiang <
> > > >> [email protected]>; Aljoscha Krettek <[email protected]
> >;
> > > >> Arvid Heise <[email protected]>
> > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > > >> release candidate #4)
> > > >> >
> > > >> > Thank you for the digging so deeply.
> > > >> > Mysterious think this regression.
> > > >> >
> > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[email protected]> wrote:
> > > >> > @Stephan: yes, I refer to sync time in the web UI (it is unchanged
> > > >> between 1.10 and 1.11 for the specific pipeline).
> > > >> >
> > > >> > I verified that increasing the checkpointing interval does not
> make
> > a
> > > >> difference.
> > > >> >
> > > >> > I looked at the Kinesis connector changes since 1.10.1 and don't
> see
> > > >> anything that could cause this.
> > > >> >
> > > >> > Another pipeline that is using the Kinesis consumer (but not the
> > > >> producer) performs as expected.
> > > >> >
> > > >> > I tried reverting the AWS SDK version change, symptoms remain
> > > unchanged:
> > > >> >
> > > >> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
> > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
> > > >> > index a6abce23ba..741743a05e 100644
> > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
> > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
> > > >> > @@ -33,7 +33,7 @@ under the License.
> > > >> >
> > > >>
> > <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> > > >> >          <name>flink-connector-kinesis</name>
> > > >> >          <properties>
> > > >> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
> > > >> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
> > > >> >
> > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> > > >> >
> > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> > > >> >
> > > >>
> > >
> >
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> > > >> >
> > > >> > I'm planning to take a look with a profiler next.
> > > >> >
> > > >> > Thomas
> > > >> >
> > > >> >
> > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[email protected]>
> > > wrote:
> > > >> > Hi all!
> > > >> >
> > > >> > Forking this thread out of the release vote thread.
> > > >> >  From what Thomas describes, it really sounds like a sink-specific
> > > >> issue.
> > > >> >
> > > >> > @Thomas: When you say sink has a long synchronous checkpoint time,
> > you
> > > >> mean the time that is shown as "sync time" on the metrics and web
> UI?
> > > That
> > > >> is not including any network buffer related operations. It is purely
> > the
> > > >> operator's time.
> > > >> >
> > > >> > Can we dig into the changes we did in sinks:
> > > >> >    - Kinesis version upgrade, AWS library updates
> > > >> >
> > > >> >    - Could it be that some call (checkpoint complete) that was
> > > >> previously (1.10) in a separate thread is not in the mailbox and
> this
> > > >> simply reduces the number of threads that do the work?
> > > >> >
> > > >> >    - Did sink checkpoint notifications change in a relevant way,
> for
> > > >> example due to some Kafka issues we addressed in 1.11 (@Aljoscha
> > maybe?)
> > > >> >
> > > >> > Best,
> > > >> > Stephan
> > > >> >
> > > >> >
> > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
> [email protected]
> > > .invalid>
> > > >> wrote:
> > > >> > Hi Thomas,
> > > >> >
> > > >> >   Regarding [2], it has more detail infos in the Jira description
> (
> > > >> https://issues.apache.org/jira/browse/FLINK-16404).
> > > >> >
> > > >> >   I can also give some basic explanations here to dismiss the
> > concern.
> > > >> >   1. In the past, the following buffers after the barrier will be
> > > >> cached on downstream side before alignment.
> > > >> >   2. In 1.11, the upstream would not send the buffers after the
> > > >> barrier. When the downstream finishes the alignment, it will notify
> > the
> > > >> downstream of continuing sending following buffers, since it can
> > process
> > > >> them after alignment.
> > > >> >   3. The only difference is that the temporary blocked buffers are
> > > >> cached either on downstream side or on upstream side before
> alignment.
> > > >> >   4. The side effect would be the additional notification cost for
> > > >> every barrier alignment. If the downstream and upstream are deployed
> > in
> > > >> separate TaskManager, the cost is network transport delay (the
> effect
> > > can
> > > >> be ignored based on our testing with 1s checkpoint interval). For
> > > sharing
> > > >> slot in your case, the cost is only one method call in processor,
> can
> > be
> > > >> ignored also.
> > > >> >
> > > >> >   You mentioned "In this case, the downstream task has a high
> > average
> > > >> checkpoint duration(~30s, sync part)." This duration is not
> reflecting
> > > the
> > > >> changes above, and it is only indicating the duration for calling
> > > >> `Operation.snapshotState`.
> > > >> >   If this duration is beyond your expectation, you can check or
> > debug
> > > >> whether the source/sink operations might take more time to finish
> > > >> `snapshotState` in practice. E.g. you can
> > > >> >   make the implementation of this method as empty to further
> verify
> > > the
> > > >> effect.
> > > >> >
> > > >> >   Best,
> > > >> >   Zhijiang
> > > >> >
> > > >> >
> > > >> >
>  ------------------------------------------------------------------
> > > >> >   From:Thomas Weise <[email protected]>
> > > >> >   Send Time:2020年7月5日(星期日) 12:22
> > > >> >   To:dev <[email protected]>; Zhijiang <
> > [email protected]
> > > >
> > > >> >   Cc:Yingjie Cao <[email protected]>
> > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >> >
> > > >> >   Hi Zhijiang,
> > > >> >
> > > >> >   Could you please point me to more details regarding: "[2]: Delay
> > > send
> > > >> the
> > > >> >   following buffers after checkpoint barrier on upstream side
> until
> > > >> barrier
> > > >> >   alignment on downstream side."
> > > >> >
> > > >> >   In this case, the downstream task has a high average checkpoint
> > > >> duration
> > > >> >   (~30s, sync part). If there was a change to hold buffers
> depending
> > > on
> > > >> >   downstream performance, could this possibly apply to this case
> > (even
> > > >> when
> > > >> >   there is no shuffle that would require alignment)?
> > > >> >
> > > >> >   Thanks,
> > > >> >   Thomas
> > > >> >
> > > >> >
> > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
> > [email protected]
> > > >> .invalid>
> > > >> >   wrote:
> > > >> >
> > > >> >   > Hi Thomas,
> > > >> >   >
> > > >> >   > Thanks for the further update information.
> > > >> >   >
> > > >> >   > I guess we can dismiss the network stack changes, since in
> your
> > > >> case the
> > > >> >   > downstream and upstream would probably be deployed in the same
> > > slot
> > > >> >   > bypassing the network data shuffle.
> > > >> >   > Also I guess release-1.11 will not bring general performance
> > > >> regression in
> > > >> >   > runtime engine, as we also did the performance testing for all
> > > >> general
> > > >> >   > cases by [1] in real cluster before and the testing results
> > should
> > > >> fit the
> > > >> >   > expectation. But we indeed did not test the specific source
> and
> > > sink
> > > >> >   > connectors yet as I known.
> > > >> >   >
> > > >> >   > Regarding your performance regression with 40%, I wonder it is
> > > >> probably
> > > >> >   > related to specific source/sink changes (e.g. kinesis) or
> > > >> environment
> > > >> >   > issues with corner case.
> > > >> >   > If possible, it would be helpful to further locate whether the
> > > >> regression
> > > >> >   > is caused by kinesis, by replacing the kinesis source & sink
> and
> > > >> keeping
> > > >> >   > the others same.
> > > >> >   >
> > > >> >   > As you said, it would be efficient to contact with you
> directly
> > > >> next week
> > > >> >   > to further discuss this issue. And we are willing/eager to
> > provide
> > > >> any help
> > > >> >   > to resolve this issue soon.
> > > >> >   >
> > > >> >   > Besides that, I guess this issue should not be the blocker for
> > the
> > > >> >   > release, since it is probably a corner case based on the
> current
> > > >> analysis.
> > > >> >   > If we really conclude anything need to be resolved after the
> > final
> > > >> >   > release, then we can also make the next minor release-1.11.1
> > come
> > > >> soon.
> > > >> >   >
> > > >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > > >> >   >
> > > >> >   > Best,
> > > >> >   > Zhijiang
> > > >> >   >
> > > >> >   >
> > > >> >   >
> > ------------------------------------------------------------------
> > > >> >   > From:Thomas Weise <[email protected]>
> > > >> >   > Send Time:2020年7月4日(星期六) 12:26
> > > >> >   > To:dev <[email protected]>; Zhijiang <
> > > [email protected]
> > > >> >
> > > >> >   > Cc:Yingjie Cao <[email protected]>
> > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >> >   >
> > > >> >   > Hi Zhijiang,
> > > >> >   >
> > > >> >   > It will probably be best if we connect next week and discuss
> the
> > > >> issue
> > > >> >   > directly since this could be quite difficult to reproduce.
> > > >> >   >
> > > >> >   > Before the testing result on our side comes out for your
> > > respective
> > > >> job
> > > >> >   > case, I have some other questions to confirm for further
> > analysis:
> > > >> >   >     -  How much percentage regression you found after
> switching
> > to
> > > >> 1.11?
> > > >> >   >
> > > >> >   > ~40% throughput decline
> > > >> >   >
> > > >> >   >     -  Are there any network bottleneck in your cluster? E.g.
> > the
> > > >> network
> > > >> >   > bandwidth is full caused by other jobs? If so, it might have
> > more
> > > >> effects
> > > >> >   > by above [2]
> > > >> >   >
> > > >> >   > The test runs on a k8s cluster that is also used for other
> > > >> production jobs.
> > > >> >   > There is no reason be believe network is the bottleneck.
> > > >> >   >
> > > >> >   >     -  Did you adjust the default network buffer setting? E.g.
> > > >> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > >> >   > "taskmanager.network.memory.buffers-per-channel"
> > > >> >   >
> > > >> >   > The job is using the defaults, i.e we don't configure the
> > > settings.
> > > >> If you
> > > >> >   > want me to try specific settings in the hope that it will help
> > to
> > > >> isolate
> > > >> >   > the issue please let me know.
> > > >> >   >
> > > >> >   >     -  I guess the topology has three vertexes
> "KinesisConsumer
> > ->
> > > >> Chained
> > > >> >   > FlatMap -> KinesisProducer", and the partition mode for
> > > >> "KinesisConsumer ->
> > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
> > so,
> > > >> the edge
> > > >> >   > connection is one-to-one, not all-to-all, then the above
> [1][2]
> > > >> should no
> > > >> >   > effects in theory with default network buffer setting.
> > > >> >   >
> > > >> >   > There are only 2 vertices and the edge is "forward".
> > > >> >   >
> > > >> >   >     - By slot sharing, I guess these three vertex parallelism
> > task
> > > >> would
> > > >> >   > probably be deployed into the same slot, then the data shuffle
> > is
> > > >> by memory
> > > >> >   > queue, not network stack. If so, the above [2] should no
> effect.
> > > >> >   >
> > > >> >   > Yes, vertices share slots.
> > > >> >   >
> > > >> >   >     - I also saw some Jira changes for kinesis in this
> release,
> > > >> could you
> > > >> >   > confirm that these changes would not effect the performance?
> > > >> >   >
> > > >> >   > I will need to take a look. 1.10 already had a regression
> > > >> introduced by the
> > > >> >   > Kinesis producer update.
> > > >> >   >
> > > >> >   >
> > > >> >   > Thanks,
> > > >> >   > Thomas
> > > >> >   >
> > > >> >   >
> > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> > > >> [email protected]
> > > >> >   > .invalid>
> > > >> >   > wrote:
> > > >> >   >
> > > >> >   > > Hi Thomas,
> > > >> >   > >
> > > >> >   > > Thanks for your reply with rich information!
> > > >> >   > >
> > > >> >   > > We are trying to reproduce your case in our cluster to
> further
> > > >> verify it,
> > > >> >   > > and  @Yingjie Cao is working on it now.
> > > >> >   > >  As we have not kinesis consumer and producer internally, so
> > we
> > > >> will
> > > >> >   > > construct the common source and sink instead in the case of
> > > >> backpressure.
> > > >> >   > >
> > > >> >   > > Firstly, we can dismiss the rockdb factor in this release,
> > since
> > > >> you also
> > > >> >   > > mentioned that "filesystem leads to same symptoms".
> > > >> >   > >
> > > >> >   > > Secondly, if my understanding is right, you emphasis that
> the
> > > >> regression
> > > >> >   > > only exists for the jobs with low checkpoint interval (10s).
> > > >> >   > > Based on that, I have two suspicions with the network
> related
> > > >> changes in
> > > >> >   > > this release:
> > > >> >   > >     - [1]: Limited the maximum backlog value (default 10) in
> > > >> subpartition
> > > >> >   > > queue.
> > > >> >   > >     - [2]: Delay send the following buffers after checkpoint
> > > >> barrier on
> > > >> >   > > upstream side until barrier alignment on downstream side.
> > > >> >   > >
> > > >> >   > > These changes are motivated for reducing the in-flight
> buffers
> > > to
> > > >> speedup
> > > >> >   > > checkpoint especially in the case of backpressure.
> > > >> >   > > In theory they should have very minor performance effect and
> > > >> actually we
> > > >> >   > > also tested in cluster to verify within expectation before
> > > >> merging them,
> > > >> >   > >  but maybe there are other corner cases we have not thought
> of
> > > >> before.
> > > >> >   > >
> > > >> >   > > Before the testing result on our side comes out for your
> > > >> respective job
> > > >> >   > > case, I have some other questions to confirm for further
> > > analysis:
> > > >> >   > >     -  How much percentage regression you found after
> > switching
> > > >> to 1.11?
> > > >> >   > >     -  Are there any network bottleneck in your cluster?
> E.g.
> > > the
> > > >> network
> > > >> >   > > bandwidth is full caused by other jobs? If so, it might have
> > > more
> > > >> effects
> > > >> >   > > by above [2]
> > > >> >   > >     -  Did you adjust the default network buffer setting?
> E.g.
> > > >> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
> > > >> >   > >     -  I guess the topology has three vertexes
> > "KinesisConsumer
> > > ->
> > > >> >   > Chained
> > > >> >   > > FlatMap -> KinesisProducer", and the partition mode for
> > > >> "KinesisConsumer
> > > >> >   > ->
> > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"?
> If
> > > >> so, the
> > > >> >   > edge
> > > >> >   > > connection is one-to-one, not all-to-all, then the above
> > [1][2]
> > > >> should no
> > > >> >   > > effects in theory with default network buffer setting.
> > > >> >   > >     - By slot sharing, I guess these three vertex
> parallelism
> > > >> task would
> > > >> >   > > probably be deployed into the same slot, then the data
> shuffle
> > > is
> > > >> by
> > > >> >   > memory
> > > >> >   > > queue, not network stack. If so, the above [2] should no
> > effect.
> > > >> >   > >     - I also saw some Jira changes for kinesis in this
> > release,
> > > >> could you
> > > >> >   > > confirm that these changes would not effect the performance?
> > > >> >   > >
> > > >> >   > > Best,
> > > >> >   > > Zhijiang
> > > >> >   > >
> > > >> >   > >
> > > >> >   > >
> > > ------------------------------------------------------------------
> > > >> >   > > From:Thomas Weise <[email protected]>
> > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
> > > >> >   > > To:dev <[email protected]>; Zhijiang <
> > > >> [email protected]>
> > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >> >   > >
> > > >> >   > > Hi Zhijiang,
> > > >> >   > >
> > > >> >   > > The performance degradation manifests in backpressure which
> > > leads
> > > >> to
> > > >> >   > > growing backlog in the source. I switched a few times
> between
> > > >> 1.10 and
> > > >> >   > 1.11
> > > >> >   > > and the behavior is consistent.
> > > >> >   > >
> > > >> >   > > The DAG is:
> > > >> >   > >
> > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
> > > >> forward
> > > >> >   > > ---------> KinesisProducer
> > > >> >   > >
> > > >> >   > > Parallelism: 160
> > > >> >   > > No shuffle/rebalance.
> > > >> >   > >
> > > >> >   > > Checkpointing config:
> > > >> >   > >
> > > >> >   > > Checkpointing Mode Exactly Once
> > > >> >   > > Interval 10s
> > > >> >   > > Timeout 10m 0s
> > > >> >   > > Minimum Pause Between Checkpoints 10s
> > > >> >   > > Maximum Concurrent Checkpoints 1
> > > >> >   > > Persist Checkpoints Externally Enabled (delete on
> > cancellation)
> > > >> >   > >
> > > >> >   > > State backend: rocksdb  (filesystem leads to same symptoms)
> > > >> >   > > Checkpoint size is tiny (500KB)
> > > >> >   > >
> > > >> >   > > An interesting difference to another job that I had upgraded
> > > >> successfully
> > > >> >   > > is the low checkpointing interval.
> > > >> >   > >
> > > >> >   > > Thanks,
> > > >> >   > > Thomas
> > > >> >   > >
> > > >> >   > >
> > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> > > >> [email protected]
> > > >> >   > > .invalid>
> > > >> >   > > wrote:
> > > >> >   > >
> > > >> >   > > > Hi Thomas,
> > > >> >   > > >
> > > >> >   > > > Thanks for the efficient feedback.
> > > >> >   > > >
> > > >> >   > > > Regarding the suggestion of adding the release notes
> > document,
> > > >> I agree
> > > >> >   > > > with your point. Maybe we should adjust the vote template
> > > >> accordingly
> > > >> >   > in
> > > >> >   > > > the respective wiki to guide the following release
> > processes.
> > > >> >   > > >
> > > >> >   > > > Regarding the performance regression, could you provide
> some
> > > >> more
> > > >> >   > details
> > > >> >   > > > for our better measurement or reproducing on our sides?
> > > >> >   > > > E.g. I guess the topology only includes two vertexes
> source
> > > and
> > > >> sink?
> > > >> >   > > > What is the parallelism for every vertex?
> > > >> >   > > > The upstream shuffles data to the downstream via rebalance
> > > >> partitioner
> > > >> >   > or
> > > >> >   > > > other?
> > > >> >   > > > The checkpoint mode is exactly-once with rocksDB state
> > > backend?
> > > >> >   > > > The backpressure happened in this case?
> > > >> >   > > > How much percentage regression in this case?
> > > >> >   > > >
> > > >> >   > > > Best,
> > > >> >   > > > Zhijiang
> > > >> >   > > >
> > > >> >   > > >
> > > >> >   > > >
> > > >> >   > > >
> > > >> ------------------------------------------------------------------
> > > >> >   > > > From:Thomas Weise <[email protected]>
> > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> > > >> >   > > > To:dev <[email protected]>
> > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >> >   > > >
> > > >> >   > > > Hi Till,
> > > >> >   > > >
> > > >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
> > > >> >   > > >
> > > >> >   > > > Generally, we carry forward the existing configuration and
> > any
> > > >> change
> > > >> >   > to
> > > >> >   > > > default configuration values would impact the upgrade.
> > > >> >   > > >
> > > >> >   > > > Yes, since it is an incompatible change I would state it
> in
> > > the
> > > >> release
> > > >> >   > > > notes.
> > > >> >   > > >
> > > >> >   > > > Thanks,
> > > >> >   > > > Thomas
> > > >> >   > > >
> > > >> >   > > > BTW I found a performance regression while trying to
> upgrade
> > > >> another
> > > >> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis
> > job.
> > > >> Wasn't
> > > >> >   > able
> > > >> >   > > > to pin it down yet, symptoms include increased checkpoint
> > > >> alignment
> > > >> >   > time.
> > > >> >   > > >
> > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> > > >> [email protected]>
> > > >> >   > > > wrote:
> > > >> >   > > >
> > > >> >   > > > > Hi Thomas,
> > > >> >   > > > >
> > > >> >   > > > > just to confirm: When starting the image in local mode,
> > then
> > > >> you
> > > >> >   > don't
> > > >> >   > > > have
> > > >> >   > > > > any of the JobManager memory configuration settings
> > > >> configured in the
> > > >> >   > > > > effective flink-conf.yaml, right? Does this mean that
> you
> > > have
> > > >> >   > > explicitly
> > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
> > > >> configuration?
> > > >> >   > > If
> > > >> >   > > > > this is the case, then I believe it was more of an
> > > >> unintentional
> > > >> >   > > artifact
> > > >> >   > > > > that it worked before and it has been corrected now so
> > that
> > > >> one needs
> > > >> >   > > to
> > > >> >   > > > > specify the memory of the JM process explicitly. Do you
> > > think
> > > >> it
> > > >> >   > would
> > > >> >   > > > help
> > > >> >   > > > > to explicitly state this in the release notes?
> > > >> >   > > > >
> > > >> >   > > > > Cheers,
> > > >> >   > > > > Till
> > > >> >   > > > >
> > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
> > [email protected]
> > > >
> > > >> wrote:
> > > >> >   > > > >
> > > >> >   > > > > > Thanks for preparing another RC!
> > > >> >   > > > > >
> > > >> >   > > > > > As mentioned in the previous RC thread, it would be
> > super
> > > >> helpful
> > > >> >   > if
> > > >> >   > > > the
> > > >> >   > > > > > release notes that are part of the documentation can
> be
> > > >> included
> > > >> >   > [1].
> > > >> >   > > > > It's
> > > >> >   > > > > > a significant time-saver to have read those first.
> > > >> >   > > > > >
> > > >> >   > > > > > I found one more non-backward compatible change that
> > would
> > > >> be worth
> > > >> >   > > > > > addressing/mentioning:
> > > >> >   > > > > >
> > > >> >   > > > > > It is now necessary to configure the jobmanager heap
> > size
> > > in
> > > >> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
> > > >> >   > > > > > or jobmanager.memory.heap.size). Why would I not want
> to
> > > do
> > > >> that
> > > >> >   > > > anyways?
> > > >> >   > > > > > Well, we set it dynamically for a cluster deployment
> via
> > > the
> > > >> >   > > > > > flinkk8soperator, but the container image can also be
> > used
> > > >> for
> > > >> >   > > testing
> > > >> >   > > > > with
> > > >> >   > > > > > local mode (./bin/jobmanager.sh start-foreground
> local).
> > > >> That will
> > > >> >   > > fail
> > > >> >   > > > > if
> > > >> >   > > > > > the heap wasn't configured and that's how I noticed
> it.
> > > >> >   > > > > >
> > > >> >   > > > > > Thanks,
> > > >> >   > > > > > Thomas
> > > >> >   > > > > >
> > > >> >   > > > > > [1]
> > > >> >   > > > > >
> > > >> >   > > > > >
> > > >> >   > > > >
> > > >> >   > > >
> > > >> >   > >
> > > >> >   >
> > > >>
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > > >> >   > > > > >
> > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> > > >> >   > [email protected]
> > > >> >   > > > > > .invalid>
> > > >> >   > > > > > wrote:
> > > >> >   > > > > >
> > > >> >   > > > > > > Hi everyone,
> > > >> >   > > > > > >
> > > >> >   > > > > > > Please review and vote on the release candidate #4
> for
> > > the
> > > >> >   > version
> > > >> >   > > > > > 1.11.0,
> > > >> >   > > > > > > as follows:
> > > >> >   > > > > > > [ ] +1, Approve the release
> > > >> >   > > > > > > [ ] -1, Do not approve the release (please provide
> > > >> specific
> > > >> >   > > comments)
> > > >> >   > > > > > >
> > > >> >   > > > > > > The complete staging area is available for your
> > review,
> > > >> which
> > > >> >   > > > includes:
> > > >> >   > > > > > > * JIRA release notes [1],
> > > >> >   > > > > > > * the official Apache source release and binary
> > > >> convenience
> > > >> >   > > releases
> > > >> >   > > > to
> > > >> >   > > > > > be
> > > >> >   > > > > > > deployed to dist.apache.org [2], which are signed
> > with
> > > >> the key
> > > >> >   > > with
> > > >> >   > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> > > [3],
> > > >> >   > > > > > > * all artifacts to be deployed to the Maven Central
> > > >> Repository
> > > >> >   > [4],
> > > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
> > > >> >   > > > > > > * website pull request listing the new release and
> > > adding
> > > >> >   > > > announcement
> > > >> >   > > > > > > blog post [6].
> > > >> >   > > > > > >
> > > >> >   > > > > > > The vote will be open for at least 72 hours. It is
> > > >> adopted by
> > > >> >   > > > majority
> > > >> >   > > > > > > approval, with at least 3 PMC affirmative votes.
> > > >> >   > > > > > >
> > > >> >   > > > > > > Thanks,
> > > >> >   > > > > > > Release Manager
> > > >> >   > > > > > >
> > > >> >   > > > > > > [1]
> > > >> >   > > > > > >
> > > >> >   > > > > >
> > > >> >   > > > >
> > > >> >   > > >
> > > >> >   > >
> > > >> >   >
> > > >>
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > >> >   > > > > > > [2]
> > > >> >   >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > >> >   > > > > > > [3]
> > > https://dist.apache.org/repos/dist/release/flink/KEYS
> > > >> >   > > > > > > [4]
> > > >> >   > > > > > >
> > > >> >   > > > >
> > > >> >   > >
> > > >>
> > https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > >> >   > > > > > > [5]
> > > >> >   > >
> > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > >> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
> > > >> >   > > > > > >
> > > >> >   > > > > > >
> > > >> >   > > > > >
> > > >> >   > > > >
> > > >> >   > > >
> > > >> >   > > >
> > > >> >   > >
> > > >> >   > >
> > > >> >   >
> > > >> >   >
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Reply via email to