Just another update: The duration of snapshotState is capped by the Kinesis producer's "RecordTtl" setting (default 30s). The sleep time in flushSync does not contribute to the observed behavior.
I guess the open question is why, with the same settings, is 1.11 since commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more checkpoints? On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise <t...@apache.org> wrote: > Hi Roman, > > Here are the checkpoint summaries for both commits: > > > https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0 > > The config: > > CheckpointConfig checkpointConfig = env.getCheckpointConfig(); > checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE); > checkpointConfig.setCheckpointInterval(*10_000*); > checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*); > checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION); > checkpointConfig.setCheckpointTimeout(600_000); > checkpointConfig.setMaxConcurrentCheckpoints(1); > checkpointConfig.setFailOnCheckpointingErrors(true); > > The values marked bold when changed to *60_000* make the symptom > disappear. I meanwhile also verified that with the 1.11.0 release commit. > > I will take a look at the sleep time issue. > > Thanks, > Thomas > > > On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <ro...@data-artisans.com> > wrote: > >> Hi Thomas, >> >> Thanks for your reply! >> >> I think you are right, we can remove this sleep and improve >> KinesisProducer. >> Probably, it's snapshotState can also be sped up by forcing records flush >> more often. >> Do you see that 30s checkpointing duration is caused by KinesisProducer >> (or maybe other operators)? >> >> I'd also like to understand the reason behind this increase in checkpoint >> frequency. >> Can you please share these values: >> - execution.checkpointing.min-pause >> - execution.checkpointing.max-concurrent-checkpoints >> - execution.checkpointing.timeout >> >> And what is the "new" observed checkpoint frequency (or how many >> checkpoints are created) compared to older versions? >> >> >> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <t...@apache.org> wrote: >> >>> Hi Roman, >>> >>> Indeed there are more frequent checkpoints with this change! The >>> application was configured to checkpoint every 10s. With 1.10 ("good >>> commit"), that leads to fewer completed checkpoints compared to 1.11 >>> ("bad >>> commit"). Just to be clear, the only difference between the two runs was >>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90 >>> >>> Since the sync part of checkpoints with the Kinesis producer always takes >>> ~30 seconds, the 10s configured checkpoint frequency really had no effect >>> before 1.11. I confirmed that both commits perform comparably by setting >>> the checkpoint frequency and min pause to 60s. >>> >>> I still have to verify with the final 1.11.0 release commit. >>> >>> It's probably good to take a look at the Kinesis producer. Is it really >>> necessary to have 500ms sleep time? What's responsible for the ~30s >>> duration in snapshotState? >>> >>> As things stand it doesn't make sense to use checkpoint intervals < 30s >>> when using the Kinesis producer. >>> >>> Thanks, >>> Thomas >>> >>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan < >>> ro...@data-artisans.com> >>> wrote: >>> >>> > Hi Thomas, >>> > >>> > Thanks a lot for the analysis. >>> > >>> > The first thing that I'd check is whether checkpoints became more >>> frequent >>> > with this commit (as each of them adds at least 500ms if there is at >>> least >>> > one not sent record, according to FlinkKinesisProducer.snapshotState). >>> > >>> > Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs >>> > first "bad" commits)? >>> > >>> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <thomas.we...@gmail.com> >>> > wrote: >>> > >>> > > I run git bisect and the first commit that shows the regression is: >>> > > >>> > > >>> > > >>> > >>> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90 >>> > > >>> > > >>> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <ykt...@gmail.com> wrote: >>> > > >>> > > > From my experience, java profilers are sometimes not accurate >>> enough to >>> > > > find out the performance regression >>> > > > root cause. In this case, I would suggest you try out intel vtune >>> > > amplifier >>> > > > to watch more detailed metrics. >>> > > > >>> > > > Best, >>> > > > Kurt >>> > > > >>> > > > >>> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <t...@apache.org> >>> wrote: >>> > > > >>> > > > > The cause of the issue is all but clear. >>> > > > > >>> > > > > Previously I had mentioned that there is no suspect change to the >>> > > Kinesis >>> > > > > connector and that I had reverted the AWS SDK change to no >>> effect. >>> > > > > >>> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed >>> > > another >>> > > > > regression in the previous release and is present before and >>> after. >>> > > > > >>> > > > > I repeated the run with 1.11.0 core and downgraded the entire >>> Kinesis >>> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is >>> still >>> > > > present. >>> > > > > Therefore we will need to look elsewhere for the root cause. >>> > > > > >>> > > > > Regarding the time spent in snapshotState, repeat runs reveal a >>> wide >>> > > > range >>> > > > > for both versions, 1.10 and 1.11. So again this is nothing >>> pointing >>> > to >>> > > a >>> > > > > root cause. >>> > > > > >>> > > > > At this point, I have no ideas remaining other than doing a >>> bisect to >>> > > > find >>> > > > > the culprit. Any other suggestions? >>> > > > > >>> > > > > Thomas >>> > > > > >>> > > > > >>> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang < >>> wangzhijiang...@aliyun.com >>> > > > > .invalid> >>> > > > > wrote: >>> > > > > >>> > > > > > Hi Thomas, >>> > > > > > >>> > > > > > Thanks for your further profiling information and glad to see >>> we >>> > > > already >>> > > > > > finalized the location to cause the regression. >>> > > > > > Actually I was also suspicious of the point of #snapshotState >>> in >>> > > > previous >>> > > > > > discussions since it indeed cost much time to block normal >>> operator >>> > > > > > processing. >>> > > > > > >>> > > > > > Based on your below feedback, the sleep time during >>> #snapshotState >>> > > > might >>> > > > > > be the main concern, and I also digged into the implementation >>> of >>> > > > > > FlinkKinesisProducer#snapshotState. >>> > > > > > while (producer.getOutstandingRecordsCount() > 0) { >>> > > > > > producer.flush(); >>> > > > > > try { >>> > > > > > Thread.sleep(500); >>> > > > > > } catch (InterruptedException e) { >>> > > > > > LOG.warn("Flushing was interrupted."); >>> > > > > > break; >>> > > > > > } >>> > > > > > } >>> > > > > > It seems that the sleep time is mainly affected by the internal >>> > > > > operations >>> > > > > > inside KinesisProducer implementation provided by amazonaws, >>> which >>> > I >>> > > am >>> > > > > not >>> > > > > > quite familiar with. >>> > > > > > But I noticed there were two upgrades related to it in >>> > > release-1.11.0. >>> > > > > One >>> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and >>> another >>> > is >>> > > > for >>> > > > > > upgrading aws-sdk-version to 1.11.754 [2]. >>> > > > > > You mentioned that you already reverted the SDK upgrade to >>> verify >>> > no >>> > > > > > changes. Did you also revert the [1] to verify? >>> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496 >>> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881 >>> > > > > > >>> > > > > > Best, >>> > > > > > Zhijiang >>> > > > > > >>> ------------------------------------------------------------------ >>> > > > > > From:Thomas Weise <t...@apache.org> >>> > > > > > Send Time:2020年7月17日(星期五) 05:29 >>> > > > > > To:dev <dev@flink.apache.org> >>> > > > > > Cc:Zhijiang <wangzhijiang...@aliyun.com>; Stephan Ewen < >>> > > > se...@apache.org >>> > > > > >; >>> > > > > > Arvid Heise <ar...@ververica.com>; Aljoscha Krettek < >>> > > > aljos...@apache.org >>> > > > > > >>> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release >>> 1.11.0, >>> > > > release >>> > > > > > candidate #4) >>> > > > > > >>> > > > > > Sorry for the delay. >>> > > > > > >>> > > > > > I confirmed that the regression is due to the sink >>> (unsurprising, >>> > > since >>> > > > > > another job with the same consumer, but not the producer, runs >>> as >>> > > > > > expected). >>> > > > > > >>> > > > > > As promised I did CPU profiling on the problematic application, >>> > which >>> > > > > gives >>> > > > > > more insight into the regression [1] >>> > > > > > >>> > > > > > The screenshots show that the average time for snapshotState >>> > > increases >>> > > > > from >>> > > > > > ~9s to ~28s. The data also shows the increase in sleep time >>> during >>> > > > > > snapshotState. >>> > > > > > >>> > > > > > Does anyone, based on changes made in 1.11, have a theory why? >>> > > > > > >>> > > > > > I had previously looked at the changes to the Kinesis >>> connector and >>> > > > also >>> > > > > > reverted the SDK upgrade, which did not change the situation. >>> > > > > > >>> > > > > > It will likely be necessary to drill into the sink / >>> checkpointing >>> > > > > details >>> > > > > > to understand the cause of the problem. >>> > > > > > >>> > > > > > Let me know if anyone has specific questions that I can answer >>> from >>> > > the >>> > > > > > profiling results. >>> > > > > > >>> > > > > > Thomas >>> > > > > > >>> > > > > > [1] >>> > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing >>> > > > > > >>> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <t...@apache.org> >>> > > wrote: >>> > > > > > >>> > > > > > > + dev@ for visibility >>> > > > > > > >>> > > > > > > I will investigate further today. >>> > > > > > > >>> > > > > > > >>> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek < >>> > > aljos...@apache.org >>> > > > > >>> > > > > > > wrote: >>> > > > > > > >>> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote: >>> > > > > > >> > - Did sink checkpoint notifications change in a >>> relevant >>> > way, >>> > > > for >>> > > > > > >> example >>> > > > > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha >>> > maybe?) >>> > > > > > >> >>> > > > > > >> I think that's unrelated: the Kafka fixes were isolated in >>> Kafka >>> > > and >>> > > > > the >>> > > > > > >> one bug I discovered on the way was about the Task reaper. >>> > > > > > >> >>> > > > > > >> >>> > > > > > >> On 07.07.20 17:51, Zhijiang wrote: >>> > > > > > >> > Sorry for my misunderstood of the previous information, >>> > Thomas. >>> > > I >>> > > > > was >>> > > > > > >> assuming that the sync checkpoint duration increased after >>> > upgrade >>> > > > as >>> > > > > it >>> > > > > > >> was mentioned before. >>> > > > > > >> > >>> > > > > > >> > If I remembered correctly, the memory state backend also >>> has >>> > the >>> > > > > same >>> > > > > > >> issue? If so, we can dismiss the rocksDB state changes. As >>> the >>> > > slot >>> > > > > > sharing >>> > > > > > >> enabled, the downstream and upstream should >>> > > > > > >> > probably deployed into the same slot, then no network >>> shuffle >>> > > > > effect. >>> > > > > > >> > >>> > > > > > >> > I think we need to find out whether it has other symptoms >>> > > changed >>> > > > > > >> besides the performance regression to further figure out the >>> > > scope. >>> > > > > > >> > E.g. any metrics changes, the number of TaskManager and >>> the >>> > > number >>> > > > > of >>> > > > > > >> slots per TaskManager from deployment changes. >>> > > > > > >> > 40% regression is really big, I guess the changes should >>> also >>> > be >>> > > > > > >> reflected in other places. >>> > > > > > >> > >>> > > > > > >> > I am not sure whether we can reproduce the regression in >>> our >>> > AWS >>> > > > > > >> environment by writing any Kinesis jobs, since there are >>> also >>> > > normal >>> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade. >>> > > > > > >> > So it probably looks like to touch some corner case. I am >>> very >>> > > > > willing >>> > > > > > >> to provide any help for debugging if possible. >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > Best, >>> > > > > > >> > Zhijiang >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > ------------------------------------------------------------------ >>> > > > > > >> > From:Thomas Weise <t...@apache.org> >>> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01 >>> > > > > > >> > To:Stephan Ewen <se...@apache.org> >>> > > > > > >> > Cc:Aljoscha Krettek <aljos...@apache.org>; Arvid Heise < >>> > > > > > >> ar...@ververica.com>; Zhijiang <wangzhijiang...@aliyun.com> >>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release >>> > > 1.11.0, >>> > > > > > >> release candidate #4) >>> > > > > > >> > >>> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We have >>> one >>> > job >>> > > > > that >>> > > > > > >> works as expected after the upgrade and the one discussed >>> here >>> > > that >>> > > > > has >>> > > > > > the >>> > > > > > >> performance regression. >>> > > > > > >> > >>> > > > > > >> > "The performance regression is obvious caused by long >>> duration >>> > > of >>> > > > > sync >>> > > > > > >> checkpoint process in Kinesis sink operator, which would >>> block >>> > the >>> > > > > > normal >>> > > > > > >> data processing until back pressure the source." >>> > > > > > >> > >>> > > > > > >> > That's a constant. Before (1.10) and upgrade have the same >>> > sync >>> > > > > > >> checkpointing time. The question is what change came in >>> with the >>> > > > > > upgrade. >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen < >>> se...@apache.org >>> > > >>> > > > > wrote: >>> > > > > > >> > >>> > > > > > >> > @Thomas Just one thing real quick: Are you using the >>> > standalone >>> > > > > setup >>> > > > > > >> scripts (like start-cluster.sh, and the former "slaves" >>> file) ? >>> > > > > > >> > Be aware that this is now called "workers" because of >>> avoiding >>> > > > > > >> sensitive names. >>> > > > > > >> > In one internal benchmark we saw quite a lot of slowdown >>> > > > initially, >>> > > > > > >> before seeing that the cluster was not a distributed >>> cluster any >>> > > > more >>> > > > > > ;-) >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang < >>> > > > wangzhijiang...@aliyun.com >>> > > > > > >>> > > > > > >> wrote: >>> > > > > > >> > Thanks for this kickoff and help analysis, Stephan! >>> > > > > > >> > Thanks for the further feedback and investigation, Thomas! >>> > > > > > >> > >>> > > > > > >> > The performance regression is obvious caused by long >>> duration >>> > of >>> > > > > sync >>> > > > > > >> checkpoint process in Kinesis sink operator, which would >>> block >>> > the >>> > > > > > normal >>> > > > > > >> data processing until back pressure the source. >>> > > > > > >> > Maybe we could dig into the process of sync execution in >>> > > > checkpoint. >>> > > > > > >> E.g. break down the steps inside respective >>> > operator#snapshotState >>> > > > to >>> > > > > > >> statistic which operation cost most of the time, then >>> > > > > > >> > we might probably find the root cause to bring such cost. >>> > > > > > >> > >>> > > > > > >> > Look forward to the further progress. :) >>> > > > > > >> > >>> > > > > > >> > Best, >>> > > > > > >> > Zhijiang >>> > > > > > >> > >>> > > > > > >> > >>> > > ------------------------------------------------------------------ >>> > > > > > >> > From:Stephan Ewen <se...@apache.org> >>> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52 >>> > > > > > >> > To:Thomas Weise <t...@apache.org> >>> > > > > > >> > Cc:Stephan Ewen <se...@apache.org>; Zhijiang < >>> > > > > > >> wangzhijiang...@aliyun.com>; Aljoscha Krettek < >>> > > aljos...@apache.org >>> > > > >; >>> > > > > > >> Arvid Heise <ar...@ververica.com> >>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release >>> > > 1.11.0, >>> > > > > > >> release candidate #4) >>> > > > > > >> > >>> > > > > > >> > Thank you for the digging so deeply. >>> > > > > > >> > Mysterious think this regression. >>> > > > > > >> > >>> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <t...@apache.org> >>> > wrote: >>> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it is >>> > > unchanged >>> > > > > > >> between 1.10 and 1.11 for the specific pipeline). >>> > > > > > >> > >>> > > > > > >> > I verified that increasing the checkpointing interval >>> does not >>> > > > make >>> > > > > a >>> > > > > > >> difference. >>> > > > > > >> > >>> > > > > > >> > I looked at the Kinesis connector changes since 1.10.1 and >>> > don't >>> > > > see >>> > > > > > >> anything that could cause this. >>> > > > > > >> > >>> > > > > > >> > Another pipeline that is using the Kinesis consumer (but >>> not >>> > the >>> > > > > > >> producer) performs as expected. >>> > > > > > >> > >>> > > > > > >> > I tried reverting the AWS SDK version change, symptoms >>> remain >>> > > > > > unchanged: >>> > > > > > >> > >>> > > > > > >> > diff --git >>> a/flink-connectors/flink-connector-kinesis/pom.xml >>> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml >>> > > > > > >> > index a6abce23ba..741743a05e 100644 >>> > > > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml >>> > > > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml >>> > > > > > >> > @@ -33,7 +33,7 @@ under the License. >>> > > > > > >> > >>> > > > > > >> >>> > > > > >>> > > >>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId> >>> > > > > > >> > <name>flink-connector-kinesis</name> >>> > > > > > >> > <properties> >>> > > > > > >> > - >>> <aws.sdk.version>1.11.754</aws.sdk.version> >>> > > > > > >> > + >>> <aws.sdk.version>1.11.603</aws.sdk.version> >>> > > > > > >> > >>> > > > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version> >>> > > > > > >> > >>> > > > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version> >>> > > > > > >> > >>> > > > > > >> >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version> >>> > > > > > >> > >>> > > > > > >> > I'm planning to take a look with a profiler next. >>> > > > > > >> > >>> > > > > > >> > Thomas >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen < >>> > se...@apache.org> >>> > > > > > wrote: >>> > > > > > >> > Hi all! >>> > > > > > >> > >>> > > > > > >> > Forking this thread out of the release vote thread. >>> > > > > > >> > From what Thomas describes, it really sounds like a >>> > > sink-specific >>> > > > > > >> issue. >>> > > > > > >> > >>> > > > > > >> > @Thomas: When you say sink has a long synchronous >>> checkpoint >>> > > time, >>> > > > > you >>> > > > > > >> mean the time that is shown as "sync time" on the metrics >>> and >>> > web >>> > > > UI? >>> > > > > > That >>> > > > > > >> is not including any network buffer related operations. It >>> is >>> > > purely >>> > > > > the >>> > > > > > >> operator's time. >>> > > > > > >> > >>> > > > > > >> > Can we dig into the changes we did in sinks: >>> > > > > > >> > - Kinesis version upgrade, AWS library updates >>> > > > > > >> > >>> > > > > > >> > - Could it be that some call (checkpoint complete) >>> that was >>> > > > > > >> previously (1.10) in a separate thread is not in the >>> mailbox and >>> > > > this >>> > > > > > >> simply reduces the number of threads that do the work? >>> > > > > > >> > >>> > > > > > >> > - Did sink checkpoint notifications change in a >>> relevant >>> > way, >>> > > > for >>> > > > > > >> example due to some Kafka issues we addressed in 1.11 >>> (@Aljoscha >>> > > > > maybe?) >>> > > > > > >> > >>> > > > > > >> > Best, >>> > > > > > >> > Stephan >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang < >>> > > > wangzhijiang...@aliyun.com >>> > > > > > .invalid> >>> > > > > > >> wrote: >>> > > > > > >> > Hi Thomas, >>> > > > > > >> > >>> > > > > > >> > Regarding [2], it has more detail infos in the Jira >>> > > description >>> > > > ( >>> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404). >>> > > > > > >> > >>> > > > > > >> > I can also give some basic explanations here to dismiss >>> the >>> > > > > concern. >>> > > > > > >> > 1. In the past, the following buffers after the barrier >>> will >>> > > be >>> > > > > > >> cached on downstream side before alignment. >>> > > > > > >> > 2. In 1.11, the upstream would not send the buffers >>> after >>> > the >>> > > > > > >> barrier. When the downstream finishes the alignment, it will >>> > > notify >>> > > > > the >>> > > > > > >> downstream of continuing sending following buffers, since >>> it can >>> > > > > process >>> > > > > > >> them after alignment. >>> > > > > > >> > 3. The only difference is that the temporary blocked >>> buffers >>> > > are >>> > > > > > >> cached either on downstream side or on upstream side before >>> > > > alignment. >>> > > > > > >> > 4. The side effect would be the additional notification >>> cost >>> > > for >>> > > > > > >> every barrier alignment. If the downstream and upstream are >>> > > deployed >>> > > > > in >>> > > > > > >> separate TaskManager, the cost is network transport delay >>> (the >>> > > > effect >>> > > > > > can >>> > > > > > >> be ignored based on our testing with 1s checkpoint >>> interval). >>> > For >>> > > > > > sharing >>> > > > > > >> slot in your case, the cost is only one method call in >>> > processor, >>> > > > can >>> > > > > be >>> > > > > > >> ignored also. >>> > > > > > >> > >>> > > > > > >> > You mentioned "In this case, the downstream task has a >>> high >>> > > > > average >>> > > > > > >> checkpoint duration(~30s, sync part)." This duration is not >>> > > > reflecting >>> > > > > > the >>> > > > > > >> changes above, and it is only indicating the duration for >>> > calling >>> > > > > > >> `Operation.snapshotState`. >>> > > > > > >> > If this duration is beyond your expectation, you can >>> check >>> > or >>> > > > > debug >>> > > > > > >> whether the source/sink operations might take more time to >>> > finish >>> > > > > > >> `snapshotState` in practice. E.g. you can >>> > > > > > >> > make the implementation of this method as empty to >>> further >>> > > > verify >>> > > > > > the >>> > > > > > >> effect. >>> > > > > > >> > >>> > > > > > >> > Best, >>> > > > > > >> > Zhijiang >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > ------------------------------------------------------------------ >>> > > > > > >> > From:Thomas Weise <t...@apache.org> >>> > > > > > >> > Send Time:2020年7月5日(星期日) 12:22 >>> > > > > > >> > To:dev <dev@flink.apache.org>; Zhijiang < >>> > > > > wangzhijiang...@aliyun.com >>> > > > > > > >>> > > > > > >> > Cc:Yingjie Cao <kevin.ying...@gmail.com> >>> > > > > > >> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4 >>> > > > > > >> > >>> > > > > > >> > Hi Zhijiang, >>> > > > > > >> > >>> > > > > > >> > Could you please point me to more details regarding: >>> "[2]: >>> > > Delay >>> > > > > > send >>> > > > > > >> the >>> > > > > > >> > following buffers after checkpoint barrier on upstream >>> side >>> > > > until >>> > > > > > >> barrier >>> > > > > > >> > alignment on downstream side." >>> > > > > > >> > >>> > > > > > >> > In this case, the downstream task has a high average >>> > > checkpoint >>> > > > > > >> duration >>> > > > > > >> > (~30s, sync part). If there was a change to hold buffers >>> > > > depending >>> > > > > > on >>> > > > > > >> > downstream performance, could this possibly apply to >>> this >>> > case >>> > > > > (even >>> > > > > > >> when >>> > > > > > >> > there is no shuffle that would require alignment)? >>> > > > > > >> > >>> > > > > > >> > Thanks, >>> > > > > > >> > Thomas >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > On Sat, Jul 4, 2020 at 7:39 AM Zhijiang < >>> > > > > wangzhijiang...@aliyun.com >>> > > > > > >> .invalid> >>> > > > > > >> > wrote: >>> > > > > > >> > >>> > > > > > >> > > Hi Thomas, >>> > > > > > >> > > >>> > > > > > >> > > Thanks for the further update information. >>> > > > > > >> > > >>> > > > > > >> > > I guess we can dismiss the network stack changes, >>> since in >>> > > > your >>> > > > > > >> case the >>> > > > > > >> > > downstream and upstream would probably be deployed in >>> the >>> > > same >>> > > > > > slot >>> > > > > > >> > > bypassing the network data shuffle. >>> > > > > > >> > > Also I guess release-1.11 will not bring general >>> > performance >>> > > > > > >> regression in >>> > > > > > >> > > runtime engine, as we also did the performance >>> testing for >>> > > all >>> > > > > > >> general >>> > > > > > >> > > cases by [1] in real cluster before and the testing >>> > results >>> > > > > should >>> > > > > > >> fit the >>> > > > > > >> > > expectation. But we indeed did not test the specific >>> > source >>> > > > and >>> > > > > > sink >>> > > > > > >> > > connectors yet as I known. >>> > > > > > >> > > >>> > > > > > >> > > Regarding your performance regression with 40%, I >>> wonder >>> > it >>> > > is >>> > > > > > >> probably >>> > > > > > >> > > related to specific source/sink changes (e.g. >>> kinesis) or >>> > > > > > >> environment >>> > > > > > >> > > issues with corner case. >>> > > > > > >> > > If possible, it would be helpful to further locate >>> whether >>> > > the >>> > > > > > >> regression >>> > > > > > >> > > is caused by kinesis, by replacing the kinesis source >>> & >>> > sink >>> > > > and >>> > > > > > >> keeping >>> > > > > > >> > > the others same. >>> > > > > > >> > > >>> > > > > > >> > > As you said, it would be efficient to contact with you >>> > > > directly >>> > > > > > >> next week >>> > > > > > >> > > to further discuss this issue. And we are >>> willing/eager to >>> > > > > provide >>> > > > > > >> any help >>> > > > > > >> > > to resolve this issue soon. >>> > > > > > >> > > >>> > > > > > >> > > Besides that, I guess this issue should not be the >>> blocker >>> > > for >>> > > > > the >>> > > > > > >> > > release, since it is probably a corner case based on >>> the >>> > > > current >>> > > > > > >> analysis. >>> > > > > > >> > > If we really conclude anything need to be resolved >>> after >>> > the >>> > > > > final >>> > > > > > >> > > release, then we can also make the next minor >>> > release-1.11.1 >>> > > > > come >>> > > > > > >> soon. >>> > > > > > >> > > >>> > > > > > >> > > [1] https://issues.apache.org/jira/browse/FLINK-18433 >>> > > > > > >> > > >>> > > > > > >> > > Best, >>> > > > > > >> > > Zhijiang >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > >>> ------------------------------------------------------------------ >>> > > > > > >> > > From:Thomas Weise <t...@apache.org> >>> > > > > > >> > > Send Time:2020年7月4日(星期六) 12:26 >>> > > > > > >> > > To:dev <dev@flink.apache.org>; Zhijiang < >>> > > > > > wangzhijiang...@aliyun.com >>> > > > > > >> > >>> > > > > > >> > > Cc:Yingjie Cao <kevin.ying...@gmail.com> >>> > > > > > >> > > Subject:Re: [VOTE] Release 1.11.0, release candidate >>> #4 >>> > > > > > >> > > >>> > > > > > >> > > Hi Zhijiang, >>> > > > > > >> > > >>> > > > > > >> > > It will probably be best if we connect next week and >>> > discuss >>> > > > the >>> > > > > > >> issue >>> > > > > > >> > > directly since this could be quite difficult to >>> reproduce. >>> > > > > > >> > > >>> > > > > > >> > > Before the testing result on our side comes out for >>> your >>> > > > > > respective >>> > > > > > >> job >>> > > > > > >> > > case, I have some other questions to confirm for >>> further >>> > > > > analysis: >>> > > > > > >> > > - How much percentage regression you found after >>> > > > switching >>> > > > > to >>> > > > > > >> 1.11? >>> > > > > > >> > > >>> > > > > > >> > > ~40% throughput decline >>> > > > > > >> > > >>> > > > > > >> > > - Are there any network bottleneck in your >>> cluster? >>> > > E.g. >>> > > > > the >>> > > > > > >> network >>> > > > > > >> > > bandwidth is full caused by other jobs? If so, it >>> might >>> > have >>> > > > > more >>> > > > > > >> effects >>> > > > > > >> > > by above [2] >>> > > > > > >> > > >>> > > > > > >> > > The test runs on a k8s cluster that is also used for >>> other >>> > > > > > >> production jobs. >>> > > > > > >> > > There is no reason be believe network is the >>> bottleneck. >>> > > > > > >> > > >>> > > > > > >> > > - Did you adjust the default network buffer >>> setting? >>> > > E.g. >>> > > > > > >> > > >>> "taskmanager.network.memory.floating-buffers-per-gate" or >>> > > > > > >> > > "taskmanager.network.memory.buffers-per-channel" >>> > > > > > >> > > >>> > > > > > >> > > The job is using the defaults, i.e we don't configure >>> the >>> > > > > > settings. >>> > > > > > >> If you >>> > > > > > >> > > want me to try specific settings in the hope that it >>> will >>> > > help >>> > > > > to >>> > > > > > >> isolate >>> > > > > > >> > > the issue please let me know. >>> > > > > > >> > > >>> > > > > > >> > > - I guess the topology has three vertexes >>> > > > "KinesisConsumer >>> > > > > -> >>> > > > > > >> Chained >>> > > > > > >> > > FlatMap -> KinesisProducer", and the partition mode >>> for >>> > > > > > >> "KinesisConsumer -> >>> > > > > > >> > > FlatMap" and "FlatMap->KinesisProducer" are both >>> > "forward"? >>> > > If >>> > > > > so, >>> > > > > > >> the edge >>> > > > > > >> > > connection is one-to-one, not all-to-all, then the >>> above >>> > > > [1][2] >>> > > > > > >> should no >>> > > > > > >> > > effects in theory with default network buffer setting. >>> > > > > > >> > > >>> > > > > > >> > > There are only 2 vertices and the edge is "forward". >>> > > > > > >> > > >>> > > > > > >> > > - By slot sharing, I guess these three vertex >>> > > parallelism >>> > > > > task >>> > > > > > >> would >>> > > > > > >> > > probably be deployed into the same slot, then the data >>> > > shuffle >>> > > > > is >>> > > > > > >> by memory >>> > > > > > >> > > queue, not network stack. If so, the above [2] should >>> no >>> > > > effect. >>> > > > > > >> > > >>> > > > > > >> > > Yes, vertices share slots. >>> > > > > > >> > > >>> > > > > > >> > > - I also saw some Jira changes for kinesis in this >>> > > > release, >>> > > > > > >> could you >>> > > > > > >> > > confirm that these changes would not effect the >>> > performance? >>> > > > > > >> > > >>> > > > > > >> > > I will need to take a look. 1.10 already had a >>> regression >>> > > > > > >> introduced by the >>> > > > > > >> > > Kinesis producer update. >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > > >> > > Thanks, >>> > > > > > >> > > Thomas >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > > >> > > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang < >>> > > > > > >> wangzhijiang...@aliyun.com >>> > > > > > >> > > .invalid> >>> > > > > > >> > > wrote: >>> > > > > > >> > > >>> > > > > > >> > > > Hi Thomas, >>> > > > > > >> > > > >>> > > > > > >> > > > Thanks for your reply with rich information! >>> > > > > > >> > > > >>> > > > > > >> > > > We are trying to reproduce your case in our cluster >>> to >>> > > > further >>> > > > > > >> verify it, >>> > > > > > >> > > > and @Yingjie Cao is working on it now. >>> > > > > > >> > > > As we have not kinesis consumer and producer >>> > internally, >>> > > so >>> > > > > we >>> > > > > > >> will >>> > > > > > >> > > > construct the common source and sink instead in the >>> case >>> > > of >>> > > > > > >> backpressure. >>> > > > > > >> > > > >>> > > > > > >> > > > Firstly, we can dismiss the rockdb factor in this >>> > release, >>> > > > > since >>> > > > > > >> you also >>> > > > > > >> > > > mentioned that "filesystem leads to same symptoms". >>> > > > > > >> > > > >>> > > > > > >> > > > Secondly, if my understanding is right, you emphasis >>> > that >>> > > > the >>> > > > > > >> regression >>> > > > > > >> > > > only exists for the jobs with low checkpoint >>> interval >>> > > (10s). >>> > > > > > >> > > > Based on that, I have two suspicions with the >>> network >>> > > > related >>> > > > > > >> changes in >>> > > > > > >> > > > this release: >>> > > > > > >> > > > - [1]: Limited the maximum backlog value >>> (default >>> > 10) >>> > > in >>> > > > > > >> subpartition >>> > > > > > >> > > > queue. >>> > > > > > >> > > > - [2]: Delay send the following buffers after >>> > > checkpoint >>> > > > > > >> barrier on >>> > > > > > >> > > > upstream side until barrier alignment on downstream >>> > side. >>> > > > > > >> > > > >>> > > > > > >> > > > These changes are motivated for reducing the >>> in-flight >>> > > > buffers >>> > > > > > to >>> > > > > > >> speedup >>> > > > > > >> > > > checkpoint especially in the case of backpressure. >>> > > > > > >> > > > In theory they should have very minor performance >>> effect >>> > > and >>> > > > > > >> actually we >>> > > > > > >> > > > also tested in cluster to verify within expectation >>> > before >>> > > > > > >> merging them, >>> > > > > > >> > > > but maybe there are other corner cases we have not >>> > > thought >>> > > > of >>> > > > > > >> before. >>> > > > > > >> > > > >>> > > > > > >> > > > Before the testing result on our side comes out for >>> your >>> > > > > > >> respective job >>> > > > > > >> > > > case, I have some other questions to confirm for >>> further >>> > > > > > analysis: >>> > > > > > >> > > > - How much percentage regression you found >>> after >>> > > > > switching >>> > > > > > >> to 1.11? >>> > > > > > >> > > > - Are there any network bottleneck in your >>> cluster? >>> > > > E.g. >>> > > > > > the >>> > > > > > >> network >>> > > > > > >> > > > bandwidth is full caused by other jobs? If so, it >>> might >>> > > have >>> > > > > > more >>> > > > > > >> effects >>> > > > > > >> > > > by above [2] >>> > > > > > >> > > > - Did you adjust the default network buffer >>> > setting? >>> > > > E.g. >>> > > > > > >> > > > >>> "taskmanager.network.memory.floating-buffers-per-gate" >>> > or >>> > > > > > >> > > > "taskmanager.network.memory.buffers-per-channel" >>> > > > > > >> > > > - I guess the topology has three vertexes >>> > > > > "KinesisConsumer >>> > > > > > -> >>> > > > > > >> > > Chained >>> > > > > > >> > > > FlatMap -> KinesisProducer", and the partition mode >>> for >>> > > > > > >> "KinesisConsumer >>> > > > > > >> > > -> >>> > > > > > >> > > > FlatMap" and "FlatMap->KinesisProducer" are both >>> > > "forward"? >>> > > > If >>> > > > > > >> so, the >>> > > > > > >> > > edge >>> > > > > > >> > > > connection is one-to-one, not all-to-all, then the >>> above >>> > > > > [1][2] >>> > > > > > >> should no >>> > > > > > >> > > > effects in theory with default network buffer >>> setting. >>> > > > > > >> > > > - By slot sharing, I guess these three vertex >>> > > > parallelism >>> > > > > > >> task would >>> > > > > > >> > > > probably be deployed into the same slot, then the >>> data >>> > > > shuffle >>> > > > > > is >>> > > > > > >> by >>> > > > > > >> > > memory >>> > > > > > >> > > > queue, not network stack. If so, the above [2] >>> should no >>> > > > > effect. >>> > > > > > >> > > > - I also saw some Jira changes for kinesis in >>> this >>> > > > > release, >>> > > > > > >> could you >>> > > > > > >> > > > confirm that these changes would not effect the >>> > > performance? >>> > > > > > >> > > > >>> > > > > > >> > > > Best, >>> > > > > > >> > > > Zhijiang >>> > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > > >>> ------------------------------------------------------------------ >>> > > > > > >> > > > From:Thomas Weise <t...@apache.org> >>> > > > > > >> > > > Send Time:2020年7月3日(星期五) 01:07 >>> > > > > > >> > > > To:dev <dev@flink.apache.org>; Zhijiang < >>> > > > > > >> wangzhijiang...@aliyun.com> >>> > > > > > >> > > > Subject:Re: [VOTE] Release 1.11.0, release >>> candidate #4 >>> > > > > > >> > > > >>> > > > > > >> > > > Hi Zhijiang, >>> > > > > > >> > > > >>> > > > > > >> > > > The performance degradation manifests in >>> backpressure >>> > > which >>> > > > > > leads >>> > > > > > >> to >>> > > > > > >> > > > growing backlog in the source. I switched a few >>> times >>> > > > between >>> > > > > > >> 1.10 and >>> > > > > > >> > > 1.11 >>> > > > > > >> > > > and the behavior is consistent. >>> > > > > > >> > > > >>> > > > > > >> > > > The DAG is: >>> > > > > > >> > > > >>> > > > > > >> > > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map) >>> > > -------- >>> > > > > > >> forward >>> > > > > > >> > > > ---------> KinesisProducer >>> > > > > > >> > > > >>> > > > > > >> > > > Parallelism: 160 >>> > > > > > >> > > > No shuffle/rebalance. >>> > > > > > >> > > > >>> > > > > > >> > > > Checkpointing config: >>> > > > > > >> > > > >>> > > > > > >> > > > Checkpointing Mode Exactly Once >>> > > > > > >> > > > Interval 10s >>> > > > > > >> > > > Timeout 10m 0s >>> > > > > > >> > > > Minimum Pause Between Checkpoints 10s >>> > > > > > >> > > > Maximum Concurrent Checkpoints 1 >>> > > > > > >> > > > Persist Checkpoints Externally Enabled (delete on >>> > > > > cancellation) >>> > > > > > >> > > > >>> > > > > > >> > > > State backend: rocksdb (filesystem leads to same >>> > > symptoms) >>> > > > > > >> > > > Checkpoint size is tiny (500KB) >>> > > > > > >> > > > >>> > > > > > >> > > > An interesting difference to another job that I had >>> > > upgraded >>> > > > > > >> successfully >>> > > > > > >> > > > is the low checkpointing interval. >>> > > > > > >> > > > >>> > > > > > >> > > > Thanks, >>> > > > > > >> > > > Thomas >>> > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > > >> > > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang < >>> > > > > > >> wangzhijiang...@aliyun.com >>> > > > > > >> > > > .invalid> >>> > > > > > >> > > > wrote: >>> > > > > > >> > > > >>> > > > > > >> > > > > Hi Thomas, >>> > > > > > >> > > > > >>> > > > > > >> > > > > Thanks for the efficient feedback. >>> > > > > > >> > > > > >>> > > > > > >> > > > > Regarding the suggestion of adding the release >>> notes >>> > > > > document, >>> > > > > > >> I agree >>> > > > > > >> > > > > with your point. Maybe we should adjust the vote >>> > > template >>> > > > > > >> accordingly >>> > > > > > >> > > in >>> > > > > > >> > > > > the respective wiki to guide the following release >>> > > > > processes. >>> > > > > > >> > > > > >>> > > > > > >> > > > > Regarding the performance regression, could you >>> > provide >>> > > > some >>> > > > > > >> more >>> > > > > > >> > > details >>> > > > > > >> > > > > for our better measurement or reproducing on our >>> > sides? >>> > > > > > >> > > > > E.g. I guess the topology only includes two >>> vertexes >>> > > > source >>> > > > > > and >>> > > > > > >> sink? >>> > > > > > >> > > > > What is the parallelism for every vertex? >>> > > > > > >> > > > > The upstream shuffles data to the downstream via >>> > > rebalance >>> > > > > > >> partitioner >>> > > > > > >> > > or >>> > > > > > >> > > > > other? >>> > > > > > >> > > > > The checkpoint mode is exactly-once with rocksDB >>> state >>> > > > > > backend? >>> > > > > > >> > > > > The backpressure happened in this case? >>> > > > > > >> > > > > How much percentage regression in this case? >>> > > > > > >> > > > > >>> > > > > > >> > > > > Best, >>> > > > > > >> > > > > Zhijiang >>> > > > > > >> > > > > >>> > > > > > >> > > > > >>> > > > > > >> > > > > >>> > > > > > >> > > > > >>> > > > > > >> >>> > ------------------------------------------------------------------ >>> > > > > > >> > > > > From:Thomas Weise <t...@apache.org> >>> > > > > > >> > > > > Send Time:2020年7月2日(星期四) 09:54 >>> > > > > > >> > > > > To:dev <dev@flink.apache.org> >>> > > > > > >> > > > > Subject:Re: [VOTE] Release 1.11.0, release >>> candidate >>> > #4 >>> > > > > > >> > > > > >>> > > > > > >> > > > > Hi Till, >>> > > > > > >> > > > > >>> > > > > > >> > > > > Yes, we don't have the setting in flink-conf.yaml. >>> > > > > > >> > > > > >>> > > > > > >> > > > > Generally, we carry forward the existing >>> configuration >>> > > and >>> > > > > any >>> > > > > > >> change >>> > > > > > >> > > to >>> > > > > > >> > > > > default configuration values would impact the >>> upgrade. >>> > > > > > >> > > > > >>> > > > > > >> > > > > Yes, since it is an incompatible change I would >>> state >>> > it >>> > > > in >>> > > > > > the >>> > > > > > >> release >>> > > > > > >> > > > > notes. >>> > > > > > >> > > > > >>> > > > > > >> > > > > Thanks, >>> > > > > > >> > > > > Thomas >>> > > > > > >> > > > > >>> > > > > > >> > > > > BTW I found a performance regression while trying >>> to >>> > > > upgrade >>> > > > > > >> another >>> > > > > > >> > > > > pipeline with this RC. It is a simple Kinesis to >>> > Kinesis >>> > > > > job. >>> > > > > > >> Wasn't >>> > > > > > >> > > able >>> > > > > > >> > > > > to pin it down yet, symptoms include increased >>> > > checkpoint >>> > > > > > >> alignment >>> > > > > > >> > > time. >>> > > > > > >> > > > > >>> > > > > > >> > > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann < >>> > > > > > >> trohrm...@apache.org> >>> > > > > > >> > > > > wrote: >>> > > > > > >> > > > > >>> > > > > > >> > > > > > Hi Thomas, >>> > > > > > >> > > > > > >>> > > > > > >> > > > > > just to confirm: When starting the image in >>> local >>> > > mode, >>> > > > > then >>> > > > > > >> you >>> > > > > > >> > > don't >>> > > > > > >> > > > > have >>> > > > > > >> > > > > > any of the JobManager memory configuration >>> settings >>> > > > > > >> configured in the >>> > > > > > >> > > > > > effective flink-conf.yaml, right? Does this mean >>> > that >>> > > > you >>> > > > > > have >>> > > > > > >> > > > explicitly >>> > > > > > >> > > > > > removed `jobmanager.heap.size: 1024m` from the >>> > default >>> > > > > > >> configuration? >>> > > > > > >> > > > If >>> > > > > > >> > > > > > this is the case, then I believe it was more of >>> an >>> > > > > > >> unintentional >>> > > > > > >> > > > artifact >>> > > > > > >> > > > > > that it worked before and it has been corrected >>> now >>> > so >>> > > > > that >>> > > > > > >> one needs >>> > > > > > >> > > > to >>> > > > > > >> > > > > > specify the memory of the JM process >>> explicitly. Do >>> > > you >>> > > > > > think >>> > > > > > >> it >>> > > > > > >> > > would >>> > > > > > >> > > > > help >>> > > > > > >> > > > > > to explicitly state this in the release notes? >>> > > > > > >> > > > > > >>> > > > > > >> > > > > > Cheers, >>> > > > > > >> > > > > > Till >>> > > > > > >> > > > > > >>> > > > > > >> > > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise < >>> > > > > t...@apache.org >>> > > > > > > >>> > > > > > >> wrote: >>> > > > > > >> > > > > > >>> > > > > > >> > > > > > > Thanks for preparing another RC! >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > > As mentioned in the previous RC thread, it >>> would >>> > be >>> > > > > super >>> > > > > > >> helpful >>> > > > > > >> > > if >>> > > > > > >> > > > > the >>> > > > > > >> > > > > > > release notes that are part of the >>> documentation >>> > can >>> > > > be >>> > > > > > >> included >>> > > > > > >> > > [1]. >>> > > > > > >> > > > > > It's >>> > > > > > >> > > > > > > a significant time-saver to have read those >>> first. >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > > I found one more non-backward compatible >>> change >>> > that >>> > > > > would >>> > > > > > >> be worth >>> > > > > > >> > > > > > > addressing/mentioning: >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > > It is now necessary to configure the >>> jobmanager >>> > heap >>> > > > > size >>> > > > > > in >>> > > > > > >> > > > > > > flink-conf.yaml (with either >>> jobmanager.heap.size >>> > > > > > >> > > > > > > or jobmanager.memory.heap.size). Why would I >>> not >>> > > want >>> > > > to >>> > > > > > do >>> > > > > > >> that >>> > > > > > >> > > > > anyways? >>> > > > > > >> > > > > > > Well, we set it dynamically for a cluster >>> > deployment >>> > > > via >>> > > > > > the >>> > > > > > >> > > > > > > flinkk8soperator, but the container image can >>> also >>> > > be >>> > > > > used >>> > > > > > >> for >>> > > > > > >> > > > testing >>> > > > > > >> > > > > > with >>> > > > > > >> > > > > > > local mode (./bin/jobmanager.sh >>> start-foreground >>> > > > local). >>> > > > > > >> That will >>> > > > > > >> > > > fail >>> > > > > > >> > > > > > if >>> > > > > > >> > > > > > > the heap wasn't configured and that's how I >>> > noticed >>> > > > it. >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > > Thanks, >>> > > > > > >> > > > > > > Thomas >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > > [1] >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > >>> > > > > > >> > > > > >>> > > > > > >> > > > >>> > > > > > >> > > >>> > > > > > >> >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang < >>> > > > > > >> > > wangzhijiang...@aliyun.com >>> > > > > > >> > > > > > > .invalid> >>> > > > > > >> > > > > > > wrote: >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > > > Hi everyone, >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > > > Please review and vote on the release >>> candidate >>> > #4 >>> > > > for >>> > > > > > the >>> > > > > > >> > > version >>> > > > > > >> > > > > > > 1.11.0, >>> > > > > > >> > > > > > > > as follows: >>> > > > > > >> > > > > > > > [ ] +1, Approve the release >>> > > > > > >> > > > > > > > [ ] -1, Do not approve the release (please >>> > provide >>> > > > > > >> specific >>> > > > > > >> > > > comments) >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > > > The complete staging area is available for >>> your >>> > > > > review, >>> > > > > > >> which >>> > > > > > >> > > > > includes: >>> > > > > > >> > > > > > > > * JIRA release notes [1], >>> > > > > > >> > > > > > > > * the official Apache source release and >>> binary >>> > > > > > >> convenience >>> > > > > > >> > > > releases >>> > > > > > >> > > > > to >>> > > > > > >> > > > > > > be >>> > > > > > >> > > > > > > > deployed to dist.apache.org [2], which are >>> > signed >>> > > > > with >>> > > > > > >> the key >>> > > > > > >> > > > with >>> > > > > > >> > > > > > > > fingerprint >>> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E >>> > > > > > [3], >>> > > > > > >> > > > > > > > * all artifacts to be deployed to the Maven >>> > > Central >>> > > > > > >> Repository >>> > > > > > >> > > [4], >>> > > > > > >> > > > > > > > * source code tag "release-1.11.0-rc4" [5], >>> > > > > > >> > > > > > > > * website pull request listing the new >>> release >>> > and >>> > > > > > adding >>> > > > > > >> > > > > announcement >>> > > > > > >> > > > > > > > blog post [6]. >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > > > The vote will be open for at least 72 >>> hours. It >>> > is >>> > > > > > >> adopted by >>> > > > > > >> > > > > majority >>> > > > > > >> > > > > > > > approval, with at least 3 PMC affirmative >>> votes. >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > > > Thanks, >>> > > > > > >> > > > > > > > Release Manager >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > > > [1] >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > >>> > > > > > >> > > > > >>> > > > > > >> > > > >>> > > > > > >> > > >>> > > > > > >> >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364 >>> > > > > > >> > > > > > > > [2] >>> > > > > > >> > > >>> > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/ >>> > > > > > >> > > > > > > > [3] >>> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS >>> > > > > > >> > > > > > > > [4] >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > >>> > > > > > >> > > > >>> > > > > > >> >>> > > > > >>> > > >>> https://repository.apache.org/content/repositories/orgapacheflink-1377/ >>> > > > > > >> > > > > > > > [5] >>> > > > > > >> > > > >>> > > > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4 >>> > > > > > >> > > > > > > > [6] >>> > https://github.com/apache/flink-web/pull/352 >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > > > >>> > > > > > >> > > > > > > >>> > > > > > >> > > > > > >>> > > > > > >> > > > > >>> > > > > > >> > > > > >>> > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> >>> > > > > > >> >>> > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> > >>> > -- >>> > Regards, >>> > Roman >>> > >>> >> >> >> -- >> Regards, >> Roman >> >