Re: Performance issue associated with managed RocksDB memory

2020-09-15 Thread Yu Li
Thanks for the follow up Juha, I've just assigned FLINK-19238 to you. Let's
further track this on JIRA.

Best Regards,
Yu


On Tue, 15 Sep 2020 at 15:04, Juha Mynttinen 
wrote:

> Hey
>
> I created this one https://issues.apache.org/jira/browse/FLINK-19238.
>
> Regards,
> Juha
> --
> *From:* Yun Tang 
> *Sent:* Tuesday, September 15, 2020 8:06 AM
> *To:* Juha Mynttinen ; Stephan Ewen <
> se...@apache.org>
> *Cc:* user@flink.apache.org 
> *Subject:* Re: Performance issue associated with managed RocksDB memory
>
> Hi Juha
>
> Would you please consider to contribute this back to community? If agreed,
> please open a JIRA ticket and we could help review your PR then.
>
> Best
> Yun Tang
> --
> *From:* Juha Mynttinen 
> *Sent:* Thursday, September 10, 2020 19:05
> *To:* Stephan Ewen 
> *Cc:* Yun Tang ; user@flink.apache.org <
> user@flink.apache.org>
> *Subject:* Re: Performance issue associated with managed RocksDB memory
>
> Hey
>
> I've fixed the code 
> (https://github.com/juha-mynttinen-king/flink/commits/arena_block_sanity_check
> [github.com]
> )
> slightly. Now it WARNs if there is the memory configuration issue. Also, I
> think there was a bug in the way the check calculated the mutable memory,
> fixed that. Also, wrote some tests.
>
> I tried the code and in my setup I get a bunch of WARN if the memory
> configuration issue is happening:
>
> 20200910T140320.516+0300  WARN RocksDBStateBackend performance will be
> poor because of the current Flink memory configuration! RocksDB will flush
> memtable constantly, causing high IO and CPU. Typically the easiest fix is
> to increase task manager managed memory size. If running locally, see the
> parameter taskmanager.memory.managed.size. Details: arenaBlockSize 8388608
> < mutableLimit 7829367 (writeBufferSize 67108864 arenaBlockSizeConfigured 0
> defaultArenaBlockSize 8388608 writeBufferManagerCapacity 8947848)
>  
> [org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.sanityCheckArenaBlockSize()
> @ 189]
>
> Regards,
> Juha
>
> --
> *From:* Stephan Ewen 
> *Sent:* Wednesday, September 9, 2020 1:56 PM
> *To:* Juha Mynttinen 
> *Cc:* Yun Tang ; user@flink.apache.org <
> user@flink.apache.org>
> *Subject:* Re: Performance issue associated with managed RocksDB memory
>
> Hey Juha!
>
> I agree that we cannot reasonably expect from the majority of users to
> understand block sizes, area sizes, etc to get their application running.
> So the default should be "inform when there is a problem and suggest to
> use more memory." Block/arena size tuning is for the absolute expertes, the
> 5% super power users.
>
> The managed memory is 128 MB by default in the mini cluster. In a
> standalone session cluster setup with default config, it is 512 MB.
>
> Best,
> Stephan
>
>
>
> On Wed, Sep 9, 2020 at 11:10 AM Juha Mynttinen 
> wrote:
>
> Hey Yun,
>
> About the docs. I saw in the docs 
> (https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html
> [ci.apache.org]
> )
> this:
>
> "An advanced option (expert mode) to reduce the number of MemTable flushes
> in setups with many states, is to tune RocksDB’s ColumnFamily options
> (arena block size, max background flush threads, etc.) via a
> RocksDBOptionsFactory".
>
> Only after debugging this issue we're talking about, I figured that this
> snippet in the docs is probably talking about the issue I'm witnessing. I
> think there are two issues here:
>
> 1) it's hard/impossible to know what kind of performance one can expect
> from a Flink application. Thus, it's hard to know if one is suffering from
> e.g. from this performance issue, or if the system is performing normally
> (and inherently being slow).
> 2) even if one suspects a performance issue, it's very hard to find the
> root cause of the performance issue (memtable flush happening frequently).
> To find out this one would need to know what's the normal flush frequency.
>
> Also the doc says "in setups with many states". The same problem is hit
> when using just one state, but "high" parallelism (5).
>
> If the arena block size _ever_ needs  to be configured only to "fix" this
> issue, it'd be best if there _never_ was a need to modify arena block size. 
> What
> if we forget even mentioning arena block size in the docs and focus o

Re: Disable WAL in RocksDB recovery

2020-09-18 Thread Yu Li
Thanks for bringing this up Juha, and good catch.

We actually are disabling WAL for routine writes by default when using
RocksDB and never encountered segment fault issues. However, from history
in FLINK-8922, segment fault issue occurs during restore if WAL is
disabled, so I guess the root cause lies in RocksDB batch write
(org.rocksdb.WriteBatch). And IMHO this is a RocksDB bug (it should work
well when WAL is disabled, no matter under single or batch write).

+1 for opening a new JIRA to figure the root cause out, fix it and disable
WAL during restore by default (maybe checking the fixes around WriteBatch
in later RocksDB versions could help locate the issue more quickly), and
thanks for volunteering taking the efforts. I will follow up and help
review if any findings / PR submission.

Best Regards,
Yu


On Wed, 16 Sep 2020 at 13:58, Juha Mynttinen 
wrote:

> Hello there,
>
> I'd like to bring to discussion a previously discussed topic - disabling
> WAL in RocksDB recovery.
>
> It's clear that WAL is not needed during the process, the reason being
> that the WAL is never read, so there's no need to write it.
>
> AFAIK the last thing that was done with WAL during recovery is an attempt
> to remove it and later reverting that removal (
> https://issues.apache.org/jira/browse/FLINK-8922). If I interpret the
> comments in the ticket correctly, what happened was that a) WAL was kept in
> the recovery, 2) it's unknown why removing WAL causes segfault.
>
> What can be seen in the ticket is that having WAL causes a significant
> performance penalty. Thus, getting rid of WAL would be a very nice
> performance improvement. I think it'd be worth to creating a new JIRA
> ticket at least as a reminder that WAL should be removed?
>
> I'm planning adding an experimental flag to remove WAL in the environment
> I'm using Flink and trying it out. If the flag is made configurable, WAL
> can always be re-enabled if removing it causes issues.
>
> Thoughts?
>
> Regards,
> Juha
>
>


Re: [ANNOUNCE] Apache Flink 1.11.2 released

2020-09-20 Thread Yu Li
Thanks Zhu Zhu for being our release manager and everyone else who made the
release possible!

Best Regards,
Yu


On Thu, 17 Sep 2020 at 13:29, Zhu Zhu  wrote:

> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.11.2, which is the second bugfix release for the Apache Flink 1.11
> series.
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Please check out the release blog post for an overview of the improvements
> for this bugfix release:
> https://flink.apache.org/news/2020/09/17/release-1.11.2.html
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12348575
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Thanks,
> Zhu
>


Re: Disable WAL in RocksDB recovery

2020-09-21 Thread Yu Li
Great, thanks for the follow up.

Best Regards,
Yu


On Mon, 21 Sep 2020 at 15:04, Juha Mynttinen 
wrote:

> Good,
>
> I opened this JIRA for the issue
> https://issues.apache.org/jira/browse/FLINK-19303. The discussion can be
> moved there.
>
> Regards,
> Juha
> ----------
> *From:* Yu Li 
> *Sent:* Friday, September 18, 2020 3:58 PM
> *To:* Juha Mynttinen 
> *Cc:* user@flink.apache.org 
> *Subject:* Re: Disable WAL in RocksDB recovery
>
> Thanks for bringing this up Juha, and good catch.
>
> We actually are disabling WAL for routine writes by default when using
> RocksDB and never encountered segment fault issues. However, from history
> in FLINK-8922, segment fault issue occurs during restore if WAL is
> disabled, so I guess the root cause lies in RocksDB batch write
> (org.rocksdb.WriteBatch). And IMHO this is a RocksDB bug (it should work
> well when WAL is disabled, no matter under single or batch write).
>
> +1 for opening a new JIRA to figure the root cause out, fix it and disable
> WAL during restore by default (maybe checking the fixes around WriteBatch
> in later RocksDB versions could help locate the issue more quickly), and
> thanks for volunteering taking the efforts. I will follow up and help
> review if any findings / PR submission.
>
> Best Regards,
> Yu
>
>
> On Wed, 16 Sep 2020 at 13:58, Juha Mynttinen 
> wrote:
>
> Hello there,
>
> I'd like to bring to discussion a previously discussed topic - disabling
> WAL in RocksDB recovery.
>
> It's clear that WAL is not needed during the process, the reason being
> that the WAL is never read, so there's no need to write it.
>
> AFAIK the last thing that was done with WAL during recovery is an attempt
> to remove it and later reverting that removal 
> (https://issues.apache.org/jira/browse/FLINK-8922
> [issues.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D8922&d=DwMFaQ&c=-0jfte1J3SKEE6FyZmTngg&r=-2x4lRPm2yEX3Ylri2jKFRC6zr9S6Iqg2kAJIspWwfA&m=AxIzKYnvz1WPfhVBb3h7dasyjYw21mR3x-cuBH3L3Ww&s=EFZry0q99qolXx6Ml-joOUoVEBQXgvsvTg5Ww0Y8ha8&e=>).
> If I interpret the comments in the ticket correctly, what happened was that
> a) WAL was kept in the recovery, 2) it's unknown why removing WAL causes
> segfault.
>
> What can be seen in the ticket is that having WAL causes a significant
> performance penalty. Thus, getting rid of WAL would be a very nice
> performance improvement. I think it'd be worth to creating a new JIRA
> ticket at least as a reminder that WAL should be removed?
>
> I'm planning adding an experimental flag to remove WAL in the environment
> I'm using Flink and trying it out. If the flag is made configurable, WAL
> can always be re-enabled if removing it causes issues.
>
> Thoughts?
>
> Regards,
> Juha
>
>


Re: flink checkpoint timeout

2020-10-05 Thread Yu Li
I'm not 100% sure but from the given information this might be related to
FLINK-14498 [1] and partially relieved by FLINK-16645 [2].

@Omkar Could you try the 1.11.0 release out and see whether the issue
disappeared?

@zhijiang  @yingjie could you also take a look
here? Thanks.

Best Regards,
Yu

[1] https://issues.apache.org/jira/browse/FLINK-14498
[2] https://issues.apache.org/jira/browse/FLINK-16645


On Fri, 18 Sep 2020 at 09:28, Deshpande, Omkar 
wrote:

> These are the hostspot method. Any pointers on debugging this? The
> checkpoints keep timing out since migrating to 1.10 from 1.9
> --
> *From:* Deshpande, Omkar 
> *Sent:* Wednesday, September 16, 2020 5:27 PM
> *To:* Congxian Qiu 
> *Cc:* user@flink.apache.org ; Yun Tang <
> myas...@live.com>
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an external sender.
>
> This thread seems to stuck in awaiting notification state -
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
>
> --
> *From:* Congxian Qiu 
> *Sent:* Monday, September 14, 2020 10:57 PM
> *To:* Deshpande, Omkar 
> *Cc:* user@flink.apache.org 
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an external sender.
>
> Hi
> You can try to find out is there is some hot method, or the snapshot
> stack is waiting for some lock. and maybe
> Best,
> Congxian
>
>
> Deshpande, Omkar  于2020年9月15日周二 下午12:30写道:
>
> Few of the subtasks fail. I cannot upgrade to 1.11 yet. But I can still
> get the thread dump. What should I be looking for in the thread dump?
>
> --
> *From:* Yun Tang 
> *Sent:* Monday, September 14, 2020 8:52 PM
> *To:* Deshpande, Omkar ; user@flink.apache.org
> 
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an external sender.
>
> Hi Omkar
>
> First of all, you should check the web UI of checkpoint [1] to see whether
> many subtasks fail to complete in time or just few of them. The former one
> might be your checkpoint time out is not enough for current case. The later
> one might be some task stuck in slow machine or cannot grab checkpoint lock
> to process sync phase of checkpointing, you can use thread dump [2] (needs
> to bump to Flink-1.11) or jstack to see what happened in java process.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/checkpoint_monitoring.html
> [2] https://issues.apache.org/jira/browse/FLINK-14816
>
> Best
> Yun Tang
> --
> *From:* Deshpande, Omkar 
> *Sent:* Tuesday, September 15, 2020 10:25
> *To:* user@flink.apache.org 
> *Subject:* Re: flink checkpoint timeout
>
> I have followed this
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
> 
> and I am using taskmanager.memory.flink.size now instead of
> taskmanager.heap.size
> --
> *From:* Deshpande, Omkar 
> *Sent:* Monday, September 14, 2020 6:23 PM
> *To:* user@flink.apache.org 
> *Subject:* flink checkpoint timeout
>
> This email is from an external sender.
>
> Hello,
>
> I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds
> first couple of times and then starts failing because of timeouts. The
> checkpoint time grows with every checkpoint and starts exceeding 10
> minutes. I do not see any exceptions in the logs. I have enabled debug
> logging at "org.apache.flink" level. How do I investigate this? The garbage
> collection seems fine. There is no backpressure. This used to work as is
> with flink 1.9 without any issue.
>
> Any pointers on how to investigate long time taken to complete checkpoint?
>
> Omkar
>
>


Re: The question about the FLIP-45

2020-03-23 Thread Yu Li
Hi LakeShen,

Sorry for the late response.

For the first question, literally, the stop command should be used if one
means to stop the job instead of canceling it.

For the second one, since FLIP-45 is still under discussion [1] [2]
(although a little bit stalled due to priority), we still don't support
stop with (retained) checkpoint yet. Accordingly, there's no implementation
in our code base.

Best Regards,
Yu

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-45-Reinforce-Job-Stop-Semantic-td30161.html
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic


On Thu, 19 Mar 2020 at 20:50, LakeShen  wrote:

> Hi community,
>
> Now I am reading the FLIP-45 Reinforce Job Stop Semantic, I have three
> questions about it :
> 1. What the command to use to stop the Flink task, stop or cancel?
>
> 2. If use stop command to stop filnk task , but I see the flink source
> code , the stop command we can set the savepoint dir , if we didn't set it
> , the default savepoint dir will use . Both the target Savepoint  Dir or
> default savepoint dir are null , the flink will throw the exception. But in
> FLIP-45 , If retained checkpoint is enabled, we should always do a
> checkpoint when stopping job. I can't find this code.
>
> Thanks to your reply.
>
> Best regards,
> LakeShen
>


Re: Streaming Job eventually begins failing during checkpointing

2020-04-27 Thread Yu Li
Sorry, just noticed this thread...

@Stephan I cannot remember the discussion but I think it's an interesting
topic, will find some time to consider it (unregister states).

@Eleanore Glad to know that Beam community has fixed it and thanks for the
reference.

Best Regards,
Yu


On Sun, 26 Apr 2020 at 03:10, Eleanore Jin  wrote:

> Hi All,
>
> I think the Beam Community fixed this issue:
> https://github.com/apache/beam/pull/11478
>
> Thanks!
> Eleanore
>
> On Thu, Apr 23, 2020 at 4:24 AM Stephan Ewen  wrote:
>
>> If something requires Beam to register a new state each time, then this
>> is tricky, because currently you cannot unregister states from Flink.
>>
>> @Yu @Yun I remember chatting about this (allowing to explicitly
>> unregister states so they get dropped from successive checkpoints) at some
>> point, but I could not find a jira ticket for this. Do you remember what
>> the status of that discussion is?
>>
>> On Thu, Apr 16, 2020 at 6:37 PM Stephen Patel  wrote:
>>
>>> I posted to the beam mailing list:
>>> https://lists.apache.org/thread.html/rb2ebfad16d85bcf668978b3defd442feda0903c20db29c323497a672%40%3Cuser.beam.apache.org%3E
>>>
>>> I think this is related to a Beam feature called RequiresStableInput
>>> (which my pipeline is using).  It will create a new operator (or keyed)
>>> state per checkpoint.  I'm not sure that there are any parameters that I
>>> have control over to tweak it's behavior (apart from increasing the
>>> checkpoint interval to let the pipeline run longer before building up that
>>> many states).
>>>
>>> Perhaps this is something that can be fixed (maybe by unregistering
>>> Operator States after they aren't used any more in the RequiresStableInput
>>> code).  It seems to me that this isn't a Flink issue, but rather a Beam
>>> issue.
>>>
>>> Thanks for pointing me in the right direction.
>>>
>>> On Thu, Apr 16, 2020 at 11:29 AM Yun Tang  wrote:
>>>
 Hi Stephen

 I think the state name [1] which would be changed every time might the
 root cause. I am not familiar with Beam code, would it be possible to
 create so many operator states? Did you configure some parameters wrongly?


 [1]
 https://github.com/apache/beam/blob/4fc924a8193bb9495c6b7ba755ced576bb8a35d5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/stableinput/BufferingDoFnRunner.java#L95

 Best
 Yun Tang
 --
 *From:* Stephen Patel 
 *Sent:* Thursday, April 16, 2020 22:30
 *To:* Yun Tang 
 *Cc:* user@flink.apache.org 
 *Subject:* Re: Streaming Job eventually begins failing during
 checkpointing

 Correction.  I've actually found a place where it potentially might be
 creating a new operator state per checkpoint:

 https://github.com/apache/beam/blob/4fc924a8193bb9495c6b7ba755ced576bb8a35d5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/stableinput/BufferingDoFnRunner.java#L91-L105
 https://github.com/apache/beam/blob/4fc924a8193bb9495c6b7ba755ced576bb8a35d5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/stableinput/BufferingDoFnRunner.java#L141-L149

 This gives me something I can investigate locally at least.

 On Thu, Apr 16, 2020 at 9:03 AM Stephen Patel 
 wrote:

 I can't say that I ever call that directly.  The beam library that I'm
 using does call it in a couple places:
 https://github.com/apache/beam/blob/v2.14.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L422-L429

 But it seems to be the same descriptor every time.  Is that limit per
 operator?  That is, can each operator host up to 32767 operator/broadcast
 states?  I assume that's by name?

 On Wed, Apr 15, 2020 at 10:46 PM Yun Tang  wrote:

 Hi  Stephen

 This is not related with RocksDB but with default on-heap operator
 state backend. From your exception stack trace, you have created too many
 operator states (more than 32767).
 How do you call context.getOperatorStateStore().getListState or
 context.getOperatorStateStore().getBroadcastState ? Did you pass a
 different operator state descriptor each time?

 Best
 Yun Tang
 --
 *From:* Stephen Patel 
 *Sent:* Thursday, April 16, 2020 2:09
 *To:* user@flink.apache.org 
 *Subject:* Streaming Job eventually begins failing during checkpointing

 I've got a flink (1.8.0, emr-5.26) streaming job running on yarn.  It's
 configured to use rocksdb, and checkpoint once a minute to hdfs.  This job
 operates just fine for around 20 days, and then begins failing with this
 exception (it fails, restarts, and fails again, repeatedly):

 2020-04-15 13:15:02,920 INFO
  org.apache.flink.runtime.checkpoint.Checkp

[ANNOUNCE] Apache Flink 1.10.1 released

2020-05-13 Thread Yu Li
The Apache Flink community is very happy to announce the release of Apache
Flink 1.10.1, which is the first bugfix release for the Apache Flink 1.10
series.

Apache Flink® is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data streaming
applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the improvements
for this bugfix release:
https://flink.apache.org/news/2020/05/12/release-1.10.1.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346891

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Regards,
Yu


Re: [DISCUSS] Upgrade HBase connector to 2.2.x

2020-06-19 Thread Yu Li
+1 on upgrading the HBase version of the connector, and 1.4.3 is indeed an
old version.

OTOH, AFAIK there're still quite some 1.x HBase clusters in production. We
could also see that the HBase community is still maintaining 1.x release
lines (with "stable-1 release" point to 1.4.13) [1]

Please also notice that HBase follows semantic versioning [2] [3] thus
don't promise any kind of compatibility (source/binary/wire, etc.) between
major versions. So if we only maintain 2.x connector, it would not be able
to work with 1.x HBase clusters.

I totally understand the additional efforts of maintaining two modules, but
since we're also reserving multiple versions for kafka connector, meantime
considering the current HBase in-production status, I'd still suggest to
get both 1.4.13 and 2.2.5 supported.

Best Regards,
Yu

[1] http://hbase.apache.org/downloads.html
[2] https://hbase.apache.org/book.html#hbase.versioning
[3] https://semver.org/


On Fri, 19 Jun 2020 at 14:58, Leonard Xu  wrote:

> +1 to support HBase 2.2.x, and +1 to retain HBase 1.4.3 until we
> deprecates finished(maybe one version is enough).
>
> Currently we only support HBase 1.4.3 which is pretty old, and I’m making
> a flink-sql-connector-hbase[1] shaded jar for pure SQL user, the
> dependencies is a little more complex.
>
>
> 在 2020年6月19日,14:20,jackylau  写道:
>
> + 1 to support HBase 2.x and the hbase 2.x client dependencies are simple
> and clear. Tbe hbase project shades them all
>
>
> Best,
> Leonard Xu
> [1] https://github.com/apache/flink/pull/12687
>
>


Re: [DISCUSS] Upgrade HBase connector to 2.2.x

2020-06-19 Thread Yu Li
One supplement:

I noticed that there are discussions in HBase ML this March about removing
stable-1 pointer and got consensus [1], and will follow up in HBase
community about why we didn't take real action. However, this doesn't
change my previous statement / stand due to the number of 1.x usages in
production.

Best Regards,
Yu

[1]
http://mail-archives.apache.org/mod_mbox/hbase-dev/202003.mbox/%3c30180be2-bd93-d414-a158-16c9c8d01...@apache.org%3E

On Fri, 19 Jun 2020 at 15:54, Yu Li  wrote:

> +1 on upgrading the HBase version of the connector, and 1.4.3 is indeed an
> old version.
>
> OTOH, AFAIK there're still quite some 1.x HBase clusters in production. We
> could also see that the HBase community is still maintaining 1.x release
> lines (with "stable-1 release" point to 1.4.13) [1]
>
> Please also notice that HBase follows semantic versioning [2] [3] thus
> don't promise any kind of compatibility (source/binary/wire, etc.) between
> major versions. So if we only maintain 2.x connector, it would not be able
> to work with 1.x HBase clusters.
>
> I totally understand the additional efforts of maintaining two modules,
> but since we're also reserving multiple versions for kafka connector,
> meantime considering the current HBase in-production status, I'd still
> suggest to get both 1.4.13 and 2.2.5 supported.
>
> Best Regards,
> Yu
>
> [1] http://hbase.apache.org/downloads.html
> [2] https://hbase.apache.org/book.html#hbase.versioning
> [3] https://semver.org/
>
>
> On Fri, 19 Jun 2020 at 14:58, Leonard Xu  wrote:
>
>> +1 to support HBase 2.2.x, and +1 to retain HBase 1.4.3 until we
>> deprecates finished(maybe one version is enough).
>>
>> Currently we only support HBase 1.4.3 which is pretty old, and I’m making
>> a flink-sql-connector-hbase[1] shaded jar for pure SQL user, the
>> dependencies is a little more complex.
>>
>>
>> 在 2020年6月19日,14:20,jackylau  写道:
>>
>> + 1 to support HBase 2.x and the hbase 2.x client dependencies are simple
>> and clear. Tbe hbase project shades them all
>>
>>
>> Best,
>> Leonard Xu
>> [1] https://github.com/apache/flink/pull/12687
>>
>>


Re: Performance issue associated with managed RocksDB memory

2020-06-25 Thread Yu Li
Thanks for the ping Andrey.

Hi Juha,

Thanks for reporting the issue. I'd like to check the below things before
further digging into it:

1. Could you let us know your configurations (especially memory related
ones) when running the tests?

2. Did you watch the memory consumption before / after turning
`state.backend.rocksdb.memory.managed` off? If not, could you check it out
and let us know the result?
2.1 Furthermore, if the memory consumption is much higher when turning
managed memory off, could you try tuning up the managed memory fraction
accordingly through `taskmanager.memory.managed.fraction` [1] and check the
result?

3. With `state.backend.rocksdb.memory.managed` on and nothing else changed,
could you try to set `state.backend.rocksdb.timer-service.factory` to
`HEAP` and check out the result? (side note: starting from 1.10.0 release
timers are stored in RocksDB by default when using RocksDBStateBackend [2])

What's more, you may find these documents [3] [4] useful for memory tunings
of RocksDB backend.

Thanks.

Best Regards,
Yu

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html#taskmanager-memory-managed-fraction
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/release-notes/flink-1.10.html#state
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/large_state_tuning.html#tuning-rocksdb-memory
[4]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/state_backends.html#memory-management


On Thu, 25 Jun 2020 at 15:37, Andrey Zagrebin  wrote:

> Hi Juha,
>
> Thanks for sharing the testing program to expose the problem.
> This indeed looks suboptimal if X does not leave space for the window
> operator.
> I am adding Yu and Yun who might have a better idea about what could be
> improved about sharing the RocksDB memory among operators.
>
> Best,
> Andrey
>
> On Thu, Jun 25, 2020 at 9:10 AM Juha Mynttinen 
> wrote:
>
>> Hey,
>>
>> Here's a simple test. It's basically the WordCount example from Flink, but
>> using RocksDB as the state backend and having a stateful operator. The
>> javadocs explain how to use it.
>>
>>
>> /*
>>  * Licensed to the Apache Software Foundation (ASF) under one or more
>>  * contributor license agreements.  See the NOTICE file distributed with
>>  * this work for additional information regarding copyright ownership.
>>  * The ASF licenses this file to You under the Apache License, Version 2.0
>>  * (the "License"); you may not use this file except in compliance with
>>  * the License.  You may obtain a copy of the License at
>>  *
>>  *http://www.apache.org/licenses/LICENSE-2.0
>>  *
>>  * Unless required by applicable law or agreed to in writing, software
>>  * distributed under the License is distributed on an "AS IS" BASIS,
>>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>> implied.
>>  * See the License for the specific language governing permissions and
>>  * limitations under the License.
>>  */
>>
>> package org.apache.flink.streaming.examples.wordcount;
>>
>> import org.apache.flink.api.common.functions.RichFlatMapFunction;
>> import org.apache.flink.api.common.state.ListState;
>> import org.apache.flink.api.common.state.ListStateDescriptor;
>> import org.apache.flink.api.common.state.ValueState;
>> import org.apache.flink.api.common.state.ValueStateDescriptor;
>> import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
>> import org.apache.flink.api.java.tuple.Tuple2;
>> import org.apache.flink.api.java.utils.MultipleParameterTool;
>> import org.apache.flink.configuration.Configuration;
>> import org.apache.flink.contrib.streaming.state.PredefinedOptions;
>> import org.apache.flink.contrib.streaming.state.RocksDBOptions;
>> import org.apache.flink.contrib.streaming.state.RocksDBStateBackend;
>> import org.apache.flink.runtime.state.FunctionInitializationContext;
>> import org.apache.flink.runtime.state.FunctionSnapshotContext;
>> import org.apache.flink.runtime.state.StateBackend;
>> import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
>> import org.apache.flink.streaming.api.datastream.DataStream;
>> import
>> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
>> import org.apache.flink.streaming.api.functions.sink.DiscardingSink;
>> import org.apache.flink.streaming.api.functions.source.SourceFunction;
>> import org.apache.flink.util.Collector;
>>
>> import java.nio.file.Files;
>> import java.nio.file.Path;
>>
>> /**
>>  * Works fast in the following cases.
>>  * 
>>  * {@link #USE_MANAGED_MEMORY} is {@code false}
>>  * {@link #USE_MANAGED_MEMORY} is {@code true} and {@link
>> #PARALLELISM} is 1 to 4.
>>  * 
>>  * 
>>  * Some results:
>>  * 
>>  * 
>>  * USE_MANAGED_MEMORY false parallelism 3: 3088 ms
>>  * USE_MANAGED_MEMORY false parallelism 4: 2971 ms
>>  * USE_MANAGED_MEMORY false parallelism 5: 2994 ms
>>  * USE_MANAGED_MEMORY true parallelism 3: 4337 ms
>>  * USE_MANAGED_MEM

Re: Performance issue associated with managed RocksDB memory

2020-06-26 Thread Yu Li
To clarify, that my questions were all against the very original issue
instead of the WordCount job. The timers come from the window operator you
mentioned as the source of the original issue:
===
bq. If I create a Flink job that has a single "heavy" operator (call it X)
that just keeps a simple state (per user) things work fast when testing how
many events / s sec the job can process. However, If I add downstream of X
a simplest possible window operator, things can get slow, especially when I
increase the parallelism
===

Regarding the WordCount job, as Andrey explained, it's kind of an expected
result and you could find more instructions from the document I posted in
the last reply [1], and let me quote some lines here for your convenience:
===

To tune memory-related performance issues, the following steps may be
helpful:

   -

   The first step to try and increase performance should be to increase the
   amount of managed memory. This usually improves the situation a lot,
   without opening up the complexity of tuning low-level RocksDB options.

   Especially with large container/process sizes, much of the total memory
   can typically go to RocksDB, unless the application logic requires a lot of
   JVM heap itself. The default managed memory fraction *(0.4)* is
   conservative and can often be increased when using TaskManagers with
   multi-GB process sizes.
   -

   The number of write buffers in RocksDB depends on the number of states
   you have in your application (states across all operators in the pipeline).
   Each state corresponds to one ColumnFamily, which needs its own write
   buffers. Hence, applications with many states typically need more memory
   for the same performance.
   -

   You can try and compare the performance of RocksDB with managed memory
   to RocksDB with per-column-family memory by setting
state.backend.rocksdb.memory.managed:
   false. Especially to test against a baseline (assuming no- or gracious
   container memory limits) or to test for regressions compared to earlier
   versions of Flink, this can be useful.

   Compared to the managed memory setup (constant memory pool), not using
   managed memory means that RocksDB allocates memory proportional to the
   number of states in the application (memory footprint changes with
   application changes). As a rule of thumb, the non-managed mode has (unless
   ColumnFamily options are applied) an upper bound of roughly “140MB *
   num-states-across-all-tasks * num-slots”. Timers count as state as well!

===

Best Regards,
Yu

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/large_state_tuning.html#tuning-rocksdb-memory


On Fri, 26 Jun 2020 at 21:15, Andrey Zagrebin 
wrote:

> Hi Juha,
>
> I can also submit the more complex test with the bigger operator and and a
> window operator. There's just gonna be more code to read. Can I attach a
> file here or how should I submit a larger chuck of code?
>
>
> You can just attach the file with the code.
>
> 2. I'm not sure what would / should I look for.
>
> For 'taskmanager.memory.managed.fraction' I tried
>
> configuration.setDouble("taskmanager.memory.managed.fraction", 0.8);
>
>
> I think Yu meant increasing the managed memory because it might be not
> enough to host both X and window operator.
> You can do it by increasing this option: taskmanager.memory.managed.size
> [1], [2]
> also if you run Flink locally from your IDE, see notes for local execution
> [3].
>
> When you enable ‘state.backend.rocksdb.memory.managed’, RocksDB does not
> use more memory than the configured or default size of managed memory.
> Therefore, it starts to spill to disk and performance degrades but the
> memory usage is deterministic and you do not risk that your container gets
> killed with out-of-memory error.
>
> If you disable ‘state.backend.rocksdb.memory.managed’, RocksDB does some
> internal decisions about how much memory to allocate, so it can allocate
> more to be more performant and do less frequent spilling to disk. So maybe
> it gives more memory to window operator to spill less.
>
> Therefore, it would be nice to compare memory consumption of Flink process
> with ‘state.backend.rocksdb.memory.managed’ to be true and false.
>
> Anyways I do not know how we could control splitting of the configured
> managed memory among operators in a more optimal way.
>
> Best,
> Andrey
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_setup.html#managed-memory
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html#taskmanager-memory-managed-size
> [3]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#local-execution
>
> On 26 Jun 2020, at 08:45, Juha Mynttinen  wrote:
>
> Andrey,
>
> A small clarification. The twea

Re: [ANNOUNCE] Apache Flink 1.10.3 released

2021-01-28 Thread Yu Li
Thanks Xintong for being our release manager and everyone else who made the
release possible!

Best Regards,
Yu


On Fri, 29 Jan 2021 at 15:05, Xintong Song  wrote:

> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.10.3, which is the third bugfix release for the Apache Flink 1.10
> series.
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Please check out the release blog post for an overview of the improvements
> for this bugfix release:
> https://flink.apache.org/news/2021/01/29/release-1.10.3.html
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12348668
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Regards,
> Xintong Song
>


Re: Re: [ANNOUNCE] Hequn becomes a Flink committer

2019-08-07 Thread Yu Li
Congratulations Hequn! Well deserved!

Best Regards,
Yu


On Thu, 8 Aug 2019 at 03:53, Haibo Sun  wrote:

> Congratulations!
>
> Best,
> Haibo
>
> At 2019-08-08 02:08:21, "Yun Tang"  wrote:
> >Congratulations Hequn.
> >
> >Best
> >Yun Tang
> >
> >From: Rong Rong 
> >Sent: Thursday, August 8, 2019 0:41
> >Cc: dev ; user 
> >Subject: Re: [ANNOUNCE] Hequn becomes a Flink committer
> >
> >Congratulations Hequn, well deserved!
> >
> >--
> >Rong
> >
> >On Wed, Aug 7, 2019 at 8:30 AM 
> >mailto:xingc...@gmail.com>> wrote:
> >
> >Congratulations, Hequn!
> >
> >
> >
> >From: Xintong Song mailto:tonysong...@gmail.com>>
> >Sent: Wednesday, August 07, 2019 10:41 AM
> >To: d...@flink.apache.org
> >Cc: user mailto:user@flink.apache.org>>
> >Subject: Re: [ANNOUNCE] Hequn becomes a Flink committer
> >
> >
> >
> >Congratulations~!
> >
> >
> >Thank you~
> >
> >Xintong Song
> >
> >
> >
> >
> >
> >On Wed, Aug 7, 2019 at 4:00 PM vino yang 
> >mailto:yanghua1...@gmail.com>> wrote:
> >
> >Congratulations!
> >
> >highfei2...@126.com 
> >mailto:highfei2...@126.com>> 于2019年8月7日周三 下午7:09写道:
> >
> >> Congrats Hequn!
> >>
> >> Best,
> >> Jeff Yang
> >>
> >>
> >>  Original Message 
> >> Subject: Re: [ANNOUNCE] Hequn becomes a Flink committer
> >> From: Piotr Nowojski
> >> To: JingsongLee
> >> CC: Biao Liu ,Zhu Zhu ,Zili Chen ,Jeff Zhang ,Paul Lam ,jincheng sun ,dev 
> >> ,user
> >>
> >>
> >> Congratulations :)
> >>
> >> On 7 Aug 2019, at 12:09, JingsongLee 
> >> mailto:lzljs3620...@aliyun.com>> wrote:
> >>
> >> Congrats Hequn!
> >>
> >> Best,
> >> Jingsong Lee
> >>
> >> --
> >> From:Biao Liu mailto:mmyy1...@gmail.com>>
> >> Send Time:2019年8月7日(星期三) 12:05
> >> To:Zhu Zhu mailto:reed...@gmail.com>>
> >> Cc:Zili Chen mailto:wander4...@gmail.com>>; Jeff 
> >> Zhang mailto:zjf...@gmail.com>>; Paul
> >> Lam mailto:paullin3...@gmail.com>>; jincheng sun 
> >> mailto:sunjincheng...@gmail.com>>; dev
> >> mailto:d...@flink.apache.org>>; user 
> >> mailto:user@flink.apache.org>>
> >> Subject:Re: [ANNOUNCE] Hequn becomes a Flink committer
> >>
> >> Congrats Hequn!
> >>
> >> Thanks,
> >> Biao /'bɪ.aʊ/
> >>
> >>
> >>
> >> On Wed, Aug 7, 2019 at 6:00 PM Zhu Zhu 
> >> mailto:reed...@gmail.com>> wrote:
> >> Congratulations to Hequn!
> >>
> >> Thanks,
> >> Zhu Zhu
> >>
> >> Zili Chen mailto:wander4...@gmail.com>> 于2019年8月7日周三 
> >> 下午5:16写道:
> >> Congrats Hequn!
> >>
> >> Best,
> >> tison.
> >>
> >>
> >> Jeff Zhang mailto:zjf...@gmail.com>> 于2019年8月7日周三 
> >> 下午5:14写道:
> >> Congrats Hequn!
> >>
> >> Paul Lam mailto:paullin3...@gmail.com>> 
> >> 于2019年8月7日周三 下午5:08写道:
> >> Congrats Hequn! Well deserved!
> >>
> >> Best,
> >> Paul Lam
> >>
> >> 在 2019年8月7日,16:28,jincheng sun 
> >> mailto:sunjincheng...@gmail.com>> 写道:
> >>
> >> Hi everyone,
> >>
> >> I'm very happy to announce that Hequn accepted the offer of the Flink PMC
> >> to become a committer of the Flink project.
> >>
> >> Hequn has been contributing to Flink for many years, mainly working on
> >> SQL/Table API features. He's also frequently helping out on the user
> >> mailing lists and helping check/vote the release.
> >>
> >> Congratulations Hequn!
> >>
> >> Best, Jincheng
> >> (on behalf of the Flink PMC)
> >>
> >>
> >>
> >> --
> >> Best Regards
> >>
> >> Jeff Zhang
> >>
> >>
> >>
>
>


Re: Capping RocksDb memory usage

2019-08-09 Thread Yu Li
Hi Cam,

Which flink version are you using?

Actually I don't think any existing flink release could take usage of the
write buffer manager natively through some configuration magic, but
requires some "developing" efforts, such as manually building flink with a
higher version rocksdb to have the JNI interface to set write buffer
manager, and set the write buffer manager into rocksdb's DBOptions with a
custom options factory. More details please refer to this comment [1] in
FLINK-7289.

As mentioned in another thread [2], we are now working on removing all
these "manual steps" and making the state backend memory management "hands
free", which is also part of the FLIP-49 work. Hopefully we could get this
done in 1.10 release, let's see (smile).

[1] https://s.apache.org/5ay97
[2] https://s.apache.org/ej2zn

Best Regards,
Yu


On Fri, 9 Aug 2019 at 03:53, Congxian Qiu  wrote:

> Hi
> Maybe FLIP-49[1] "Unified Memory Configuration for TaskExecutors" can give
> some information here
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-49-Unified-Memory-Configuration-for-TaskExecutors-td31436.html
> Best,
> Congxian
>
>
> Cam Mach  于2019年8月9日周五 上午4:59写道:
>
>> Hi Biao, Yun and Ning.
>>
>> Thanks for your response and pointers. Those are very helpful!
>>
>> So far, we have tried with some of those parameters (WriterBufferManager,
>> write_buffer_size, write_buffer_count, ...), but still continuously having
>> issues with memory.
>> Here are our cluster configurations:
>>
>>- 1 Job Controller (32 GB RAM and 8 cores)
>>- 10 Task Managers: (32 GB RAM, 8 cores CPU, and 300GB SSD configured
>>for RocksDB, and we set 10GB heap for each)
>>- Running under Kuberntes
>>
>> We have a pipeline that read/transfer 500 million records (around 1kb
>> each), and write to our sink. Our total data is around 1.2 Terabytes. Our
>> pipeline configurations are as follows:
>>
>>- 13 operators - some of them (around 6) are stateful
>>- Parallelism: 60
>>- Task slots: 6
>>
>> We have run several tests and observed that memory just keep growing
>> while our TM's CPU stay around 10 - 15% usage. We are now just focusing
>> limiting memory usage from Flink and RocksDB, so Kubernetes won't kill it.
>>
>> Any recommendations or advices are greatly appreciated!
>>
>> Thanks,
>>
>>
>>
>>
>> On Thu, Aug 8, 2019 at 6:57 AM Yun Tang  wrote:
>>
>>> Hi Cam
>>>
>>> I think FLINK-7289 [1] might offer you some insights to control RocksDB
>>> memory, especially the idea using write buffer manager [2] to control the
>>> total write buffer memory. If you do not have too many sst files, write
>>> buffer memory usage would consume much more space than index and filter
>>> usage. Since Flink would use per state per column family, and the write
>>> buffer number increase when more column families created.
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-7289
>>> [2] https://github.com/dataArtisans/frocksdb/pull/4
>>>
>>> Best
>>> Yun Tang
>>>
>>>
>>> --
>>> *From:* Cam Mach 
>>> *Sent:* Thursday, August 8, 2019 21:39
>>> *To:* Biao Liu 
>>> *Cc:* miki haiat ; user 
>>> *Subject:* Re: Capping RocksDb memory usage
>>>
>>> Thanks for your response, Biao.
>>>
>>>
>>>
>>> On Wed, Aug 7, 2019 at 11:41 PM Biao Liu  wrote:
>>>
>>> Hi Cam,
>>>
>>> AFAIK, that's not an easy thing. Actually it's more like a Rocksdb
>>> issue. There is a document explaining the memory usage of Rocksdb [1]. It
>>> might be helpful.
>>>
>>> You could define your own option to tune Rocksdb through
>>> "state.backend.rocksdb.options-factory" [2]. However I would suggest not to
>>> do this unless you are fully experienced of Rocksdb. IMO it's quite
>>> complicated.
>>>
>>> Meanwhile I can share a bit experience of this. We have tried to put the
>>> cache and filter into block cache before. It's useful to control the memory
>>> usage. But the performance might be affected at the same time. Anyway you
>>> could try and tune it. Good luck!
>>>
>>> 1. https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
>>> 2.
>>> https://ci.apache.org/projects/flink/flink-docs-master/ops/state/large_state_tuning.html#tuning-rocksdb
>>>
>>> Thanks,
>>> Biao /'bɪ.aʊ/
>>>
>>>
>>>
>>> On Thu, Aug 8, 2019 at 11:44 AM Cam Mach  wrote:
>>>
>>> Yes, that is correct.
>>> Cam Mach
>>> Software Engineer
>>> E-mail: cammac...@gmail.com
>>> Tel: 206 972 2768
>>>
>>>
>>>
>>> On Wed, Aug 7, 2019 at 8:33 PM Biao Liu  wrote:
>>>
>>> Hi Cam,
>>>
>>> Do you mean you want to limit the memory usage of RocksDB state backend?
>>>
>>> Thanks,
>>> Biao /'bɪ.aʊ/
>>>
>>>
>>>
>>> On Thu, Aug 8, 2019 at 2:23 AM miki haiat  wrote:
>>>
>>> I think using metrics exporter is the easiest way
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#rocksdb
>>>
>>>
>>> On Wed, Aug 7, 2019, 20:28 Cam Mach  wrote:
>>>
>>> Hello everyone,
>>>
>>> What is the most easy and efficiently way to cap RocksDb's 

Re: Capping RocksDb memory usage

2019-08-09 Thread Yu Li
bq. Yes, we recompiled Flink with rocksdb to have JNI, to enable the
write_buffer_manager after we read that Jira.
I see, then which way are you using to limit the rocksdb memory? Setting
write buffer and block cache size separately or with the "cost memory used
in memtable into block cache" [1] feature? If the latter one, please make
sure you also have this PR [2] in your customized rocksdb.

bq. I noticed that our disk usage (SSD) for RocksDb is always stay around
%2 (or 2.2 GB), which is not the case before we enable RocksDb state backend
Along with the state data ingestion as well as checkpoint execution,
RocksDB state backend will flush sst files out onto local disk (along with
a file uploading to HDFS when checkpointing). For heap backend, all data
resident in memory, and write directly onto HDFS when checkpoint triggered,
thus no local disk space usage.

What's more, notice that if you enable local recovery (check whether
"state.backend.local-recovery" is set to true in your configuration, by
default it's false), there'll be more disk space occupation, but in this
case both heap and rocksdb backend have the cost.

[1]
https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache
[2] https://github.com/facebook/rocksdb/pull/4695

Best Regards,
Yu


On Fri, 9 Aug 2019 at 15:10, Cam Mach  wrote:

> Hi Yu,
>
> Yes, we recompiled Flink with rocksdb to have JNI, to enable the
> write_buffer_manager after we read that Jira.
> One quick question, I noticed that our disk usage (SSD) for RocksDb is
> always stay around %2 (or 2.2 GB), which is not the case before we enable
> RocksDb state backend. So wondering what stoping it?
>
> Thanks,
> Cam
>
>
>
> On Fri, Aug 9, 2019 at 12:21 AM Yu Li  wrote:
>
>> Hi Cam,
>>
>> Which flink version are you using?
>>
>> Actually I don't think any existing flink release could take usage of the
>> write buffer manager natively through some configuration magic, but
>> requires some "developing" efforts, such as manually building flink with a
>> higher version rocksdb to have the JNI interface to set write buffer
>> manager, and set the write buffer manager into rocksdb's DBOptions with a
>> custom options factory. More details please refer to this comment [1] in
>> FLINK-7289.
>>
>> As mentioned in another thread [2], we are now working on removing all
>> these "manual steps" and making the state backend memory management "hands
>> free", which is also part of the FLIP-49 work. Hopefully we could get this
>> done in 1.10 release, let's see (smile).
>>
>> [1] https://s.apache.org/5ay97
>> [2] https://s.apache.org/ej2zn
>>
>> Best Regards,
>> Yu
>>
>>
>> On Fri, 9 Aug 2019 at 03:53, Congxian Qiu  wrote:
>>
>>> Hi
>>> Maybe FLIP-49[1] "Unified Memory Configuration for TaskExecutors" can
>>> give some information here
>>>
>>> [1]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-49-Unified-Memory-Configuration-for-TaskExecutors-td31436.html
>>> Best,
>>> Congxian
>>>
>>>
>>> Cam Mach  于2019年8月9日周五 上午4:59写道:
>>>
>>>> Hi Biao, Yun and Ning.
>>>>
>>>> Thanks for your response and pointers. Those are very helpful!
>>>>
>>>> So far, we have tried with some of those parameters
>>>> (WriterBufferManager, write_buffer_size, write_buffer_count, ...), but
>>>> still continuously having issues with memory.
>>>> Here are our cluster configurations:
>>>>
>>>>- 1 Job Controller (32 GB RAM and 8 cores)
>>>>- 10 Task Managers: (32 GB RAM, 8 cores CPU, and 300GB SSD
>>>>configured for RocksDB, and we set 10GB heap for each)
>>>>- Running under Kuberntes
>>>>
>>>> We have a pipeline that read/transfer 500 million records (around 1kb
>>>> each), and write to our sink. Our total data is around 1.2 Terabytes. Our
>>>> pipeline configurations are as follows:
>>>>
>>>>- 13 operators - some of them (around 6) are stateful
>>>>- Parallelism: 60
>>>>- Task slots: 6
>>>>
>>>> We have run several tests and observed that memory just keep growing
>>>> while our TM's CPU stay around 10 - 15% usage. We are now just focusing
>>>> limiting memory usage from Flink and RocksDB, so Kubernetes won't kill it.
>>>>
>>>> Any recommendations or advices are greatly appreciated

Re: Making broadcast state queryable?

2019-08-13 Thread Yu Li
Hi Oytun,

Sorry but TBH such support will probably not be added in the foreseeable
future due to lack of committer bandwidth (not only support queryable
broadcast state but all about QueryableState module) as pointed out in
other threads [1] [2].

However, I think you could open a JIRA for this so when things changed it
could be taken care of. Thanks.

[1] https://s.apache.org/MaOl
[2] https://s.apache.org/r8k8a

Best Regards,
Yu


On Tue, 13 Aug 2019 at 20:34, Oytun Tez  wrote:

> Hi there,
>
> Can we set a broadcast state as queryable? I've looked around, not much to
> find about it. I am receiving UnknownKvStateLocation when I try to query
> with the descriptor/state name I give to the broadcast state.
>
> If it doesn't work, what could be the alternative? My mind goes around
> ctx.getBroadcastState and making it queryable in the operator level (I'd
> rather not).
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> oy...@motaword.com — www.motaword.com
>


Re: Making broadcast state queryable?

2019-08-14 Thread Yu Li
Good to know your thoughts and the coming talk in Flink Forward Berlin
Oytun, interesting topic and great job! And it's great to hear the voice
from application perspective.

Could you share, if possible, the reason why you need to open the state to
outside instead of writing the result to sink and directly query there? In
another thread there's a case that sink writes to different kafka topics so
state is the only place to get a global view, is this the same case you're
facing? Or some different requirements? I believe more attention will be
drawn to QS if more and more user requirements emerge (smile).

Thanks.

Best Regards,
Yu


On Wed, 14 Aug 2019 at 00:50, Oytun Tez  wrote:

> Thank you for the honest response, Yu!
>
> There is so much that comes to mind when we look at Flink as a
> "application framework" (my talk
> <https://europe-2019.flink-forward.org/conference-program#not-so-big-%E2%80%93-flink-as-a-true-application-framework>
> in Flink Forward in Berlin will be about this). QS is one of them
> (need-wise, not QS itself necessarily). I opened the design doc Vino Yang
> created, I'll try to contribute to that.
>
> Meanwhile, for the sake of opening our state to outside, we will put a
> stupid simple operator in between to keep a *duplicate* of the state...
>
> Thanks again!
>
>
>
>
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> oy...@motaword.com — www.motaword.com
>
>
> On Tue, Aug 13, 2019 at 6:29 PM Yu Li  wrote:
>
>> Hi Oytun,
>>
>> Sorry but TBH such support will probably not be added in the foreseeable
>> future due to lack of committer bandwidth (not only support queryable
>> broadcast state but all about QueryableState module) as pointed out in
>> other threads [1] [2].
>>
>> However, I think you could open a JIRA for this so when things changed it
>> could be taken care of. Thanks.
>>
>> [1] https://s.apache.org/MaOl
>> [2] https://s.apache.org/r8k8a
>>
>> Best Regards,
>> Yu
>>
>>
>> On Tue, 13 Aug 2019 at 20:34, Oytun Tez  wrote:
>>
>>> Hi there,
>>>
>>> Can we set a broadcast state as queryable? I've looked around, not much
>>> to find about it. I am receiving UnknownKvStateLocation when I try to query
>>> with the descriptor/state name I give to the broadcast state.
>>>
>>> If it doesn't work, what could be the alternative? My mind goes around
>>> ctx.getBroadcastState and making it queryable in the operator level (I'd
>>> rather not).
>>>
>>> ---
>>> Oytun Tez
>>>
>>> *M O T A W O R D*
>>> The World's Fastest Human Translation Platform.
>>> oy...@motaword.com — www.motaword.com
>>>
>>


Re: [ANNOUNCE] Apache Flink 1.9.0 released

2019-08-22 Thread Yu Li
Thanks for the update Gordon, and congratulations!

Great thanks to all for making this release possible, especially to our
release managers!

Best Regards,
Yu


On Thu, 22 Aug 2019 at 14:55, Xintong Song  wrote:

> Congratulations!
> Thanks Gordon and Kurt for being the release managers, and thanks all the
> contributors.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Aug 22, 2019 at 2:39 PM Yun Gao  wrote:
>
>>  Congratulations !
>>
>>  Very thanks for Gordon and Kurt for managing the release and very
>> thanks for everyone for the contributions !
>>
>>   Best,
>>   Yun
>>
>>
>>
>> --
>> From:Zhu Zhu 
>> Send Time:2019 Aug. 22 (Thu.) 20:18
>> To:Eliza 
>> Cc:user 
>> Subject:Re: [ANNOUNCE] Apache Flink 1.9.0 released
>>
>> Thanks Gordon for the update.
>> Congratulations that we have Flink 1.9.0 released!
>> Thanks to all the contributors.
>>
>> Thanks,
>> Zhu Zhu
>>
>>
>> Eliza  于2019年8月22日周四 下午8:10写道:
>>
>>
>> On 2019/8/22 星期四 下午 8:03, Tzu-Li (Gordon) Tai wrote:
>> > The Apache Flink community is very happy to announce the release of
>> > Apache Flink 1.9.0, which is the latest major release.
>>
>> Congratulations and thanks~
>>
>> regards.
>>
>>
>>


Re: [SURVEY] Is the default restart delay of 0s causing problems?

2019-09-01 Thread Yu Li
-1 on increasing the default delay to none zero, with below reasons:

a) I could see some concerns about setting the delay to zero in the very
original JIRA (FLINK-2993 )
but later on in FLINK-9158
 we still decided to make
the change, so I'm wondering whether the decision also came from any
customer requirement? If so, how could we judge whether one requirement
override the other?

b) There could be valid reasons for both default values depending on
different use cases, as well as relative work around (like based on latest
policy, setting the config manually to 10s could resolve the problem
mentioned), and from former replies to this thread we could see users have
already taken actions. Changing it back to non-zero again won't affect such
users but might cause surprises to those depending on 0 as default.

Last but not least, no matter what decision we make this time, I'd suggest
to make it final and document in our release note explicitly. Checking the
1.5.0 release note [1] [2] it seems we didn't mention about the change on
default restart delay and we'd better learn from it this time. Thanks.

[1]
https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html

Best Regards,
Yu


On Sun, 1 Sep 2019 at 04:33, Steven Wu  wrote:

> +1 on what Zhu Zhu said.
>
> We also override the default to 10 s.
>
> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu  wrote:
>
>> In our production, we usually override the restart delay to be 10 s.
>> We once encountered cases that external services are overwhelmed by
>> reconnections from frequent restarted tasks.
>> As a safer though not optimized option, a default delay larger than 0 s
>> is better in my opinion.
>>
>>
>> 未来阳光 <2217232...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>
>>> Hi,
>>>
>>>
>>> I thinks it's better to increase the default value. +1
>>>
>>>
>>> Best.
>>>
>>>
>>>
>>>
>>> -- 原始邮件 --
>>> 发件人: "Till Rohrmann";
>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>> 收件人: "dev"; "user";
>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>
>>>
>>>
>>> Hi everyone,
>>>
>>> I wanted to reach out to you and ask whether decreasing the default delay
>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>>> user reported that he would like to increase the default value because it
>>> can cause restart storms in case of systematic faults [2].
>>>
>>> The downside of increasing the default delay would be a slightly
>>> increased
>>> restart time if this config option is not explicitly set.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>
>>> Cheers,
>>> Till
>>
>>


Re: [ANNOUNCE] Kinesis connector becomes part of Flink releases

2019-09-01 Thread Yu Li
Great to know, thanks for the efforts Bowen!

And I believe it worth a release note in the original JIRA, wdyt? Thanks.

Best Regards,
Yu


On Sat, 31 Aug 2019 at 11:01, Bowen Li  wrote:

> Hi all,
>
> I'm glad to announce that, as #9494
> was merged today,
> flink-connector-kinesis is officially of Apache 2.0 license now in master
> branch and its artifact will be deployed to Maven central as part of Flink
> releases starting from Flink 1.10.0. Users can use the artifact out of
> shelf then and no longer have to build and maintain it on their own.
>
> It brings a much better user experience to our large AWS customer base by
> making their work simpler, smoother, and more productive!
>
> Thanks everyone who participated in coding and review to drive this
> initiative forward.
>
> Cheers,
> Bowen
>


Re: Making broadcast state queryable?

2019-09-15 Thread Yu Li
Thanks for the reply Oytun (and sorry for the late response, somehow just
noticed).

Requirement received, interesting one. Let's see whether this could draw
any attention from the committers (smile).

Best Regards,
Yu


On Fri, 6 Sep 2019 at 22:14, Oytun Tez  wrote:

> Hi Yu,
>
> Excuse my late reply... We simply want Flink to be our centralized stream
> analysis platform, where we 1) consume data, 2) generate analysis, 3)
> present analysis. I honestly don't want "stream analysis" to spill out to
> other components in our ecosystem (e.g., sinking insights into a DB-like
> place).
>
> So the case for QS for us is centralization of input, output,
> presentation. State Processor API for instance also counts as a
> presentation tool for us (on top of migration tool).
>
> This kind of all-in-one (in, out, ui) packaging helps with small teams.
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> oy...@motaword.com — www.motaword.com
>
>
> On Wed, Aug 14, 2019 at 3:13 AM Yu Li  wrote:
>
>> Good to know your thoughts and the coming talk in Flink Forward Berlin
>> Oytun, interesting topic and great job! And it's great to hear the voice
>> from application perspective.
>>
>> Could you share, if possible, the reason why you need to open the state
>> to outside instead of writing the result to sink and directly query there?
>> In another thread there's a case that sink writes to different kafka topics
>> so state is the only place to get a global view, is this the same case
>> you're facing? Or some different requirements? I believe more attention
>> will be drawn to QS if more and more user requirements emerge (smile).
>>
>> Thanks.
>>
>> Best Regards,
>> Yu
>>
>>
>> On Wed, 14 Aug 2019 at 00:50, Oytun Tez  wrote:
>>
>>> Thank you for the honest response, Yu!
>>>
>>> There is so much that comes to mind when we look at Flink as a
>>> "application framework" (my talk
>>> <https://europe-2019.flink-forward.org/conference-program#not-so-big-%E2%80%93-flink-as-a-true-application-framework>
>>> in Flink Forward in Berlin will be about this). QS is one of them
>>> (need-wise, not QS itself necessarily). I opened the design doc Vino Yang
>>> created, I'll try to contribute to that.
>>>
>>> Meanwhile, for the sake of opening our state to outside, we will put a
>>> stupid simple operator in between to keep a *duplicate* of the state...
>>>
>>> Thanks again!
>>>
>>>
>>>
>>>
>>>
>>> ---
>>> Oytun Tez
>>>
>>> *M O T A W O R D*
>>> The World's Fastest Human Translation Platform.
>>> oy...@motaword.com — www.motaword.com
>>>
>>>
>>> On Tue, Aug 13, 2019 at 6:29 PM Yu Li  wrote:
>>>
>>>> Hi Oytun,
>>>>
>>>> Sorry but TBH such support will probably not be added in the
>>>> foreseeable future due to lack of committer bandwidth (not only support
>>>> queryable broadcast state but all about QueryableState module) as pointed
>>>> out in other threads [1] [2].
>>>>
>>>> However, I think you could open a JIRA for this so when things changed
>>>> it could be taken care of. Thanks.
>>>>
>>>> [1] https://s.apache.org/MaOl
>>>> [2] https://s.apache.org/r8k8a
>>>>
>>>> Best Regards,
>>>> Yu
>>>>
>>>>
>>>> On Tue, 13 Aug 2019 at 20:34, Oytun Tez  wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> Can we set a broadcast state as queryable? I've looked around, not
>>>>> much to find about it. I am receiving UnknownKvStateLocation when I try to
>>>>> query with the descriptor/state name I give to the broadcast state.
>>>>>
>>>>> If it doesn't work, what could be the alternative? My mind goes around
>>>>> ctx.getBroadcastState and making it queryable in the operator level (I'd
>>>>> rather not).
>>>>>
>>>>> ---
>>>>> Oytun Tez
>>>>>
>>>>> *M O T A W O R D*
>>>>> The World's Fastest Human Translation Platform.
>>>>> oy...@motaword.com — www.motaword.com
>>>>>
>>>>


Re: [PROPOSAL] Contribute Stateful Functions to Apache Flink

2019-10-12 Thread Yu Li
Hi Stephan,

Big +1 for adding stateful functions to Flink. I believe a lot of user
would be interested to try this out and I could imagine how this could
contribute to reduce the TCO for business requiring both streaming
processing and stateful functions.

And my 2 cents is to put it into flink core repository since I could see a
tight connection between this library and flink state.

Best Regards,
Yu


On Sat, 12 Oct 2019 at 17:31, jincheng sun  wrote:

> Hi Stephan,
>
> bit +1 for adding this great features to Apache Flink.
>
> Regarding where we should place it, put it into Flink core repository or
> create a separate repository? I prefer put it into main repository and
> looking forward the more detail discussion for this decision.
>
> Best,
> Jincheng
>
>
> Jingsong Li  于2019年10月12日周六 上午11:32写道:
>
>> Hi Stephan,
>>
>> big +1 for this contribution. It provides another user interface that is
>> easy to use and popular at this time. these functions, It's hard for users
>> to write in SQL/TableApi, while using DataStream is too complex. (We've
>> done some stateFun kind jobs using DataStream before). With statefun, it is
>> very easy.
>>
>> I think it's also a good opportunity to exercise Flink's core
>> capabilities. I looked at stateful-functions-flink briefly, it is very
>> interesting. I think there are many other things Flink can improve. So I
>> think it's a better thing to put it into Flink, and the improvement for it
>> will be more natural in the future.
>>
>> Best,
>> Jingsong Lee
>>
>> On Fri, Oct 11, 2019 at 7:33 PM Dawid Wysakowicz 
>> wrote:
>>
>>> Hi Stephan,
>>>
>>> I think this is a nice library, but what I like more about it is that it
>>> suggests exploring different use-cases. I think it definitely makes sense
>>> for the Flink community to explore more lightweight applications that
>>> reuses resources. Therefore I definitely think it is a good idea for Flink
>>> community to accept this contribution and help maintaining it.
>>>
>>> Personally I'd prefer to have it in a separate repository. There were a
>>> few discussions before where different people were suggesting to extract
>>> connectors and other libraries to separate repositories. Moreover I think
>>> it could serve as an example for the Flink ecosystem website[1]. This could
>>> be the first project in there and give a good impression that the community
>>> sees potential in the ecosystem website.
>>>
>>> Lastly, I'm wondering if this should go through PMC vote according to
>>> our bylaws[2]. In the end the suggestion is to adopt an existing code base
>>> as is. It also proposes a new programs concept that could result in a shift
>>> of priorities for the community in a long run.
>>>
>>> Best,
>>>
>>> Dawid
>>>
>>> [1]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Create-a-Flink-ecosystem-website-td27519.html
>>>
>>> [2] https://cwiki.apache.org/confluence/display/FLINK/Flink+Bylaws
>>> On 11/10/2019 13:12, Till Rohrmann wrote:
>>>
>>> Hi Stephan,
>>>
>>> +1 for adding stateful functions to Flink. I believe the new set of
>>> applications this feature will unlock will be super interesting for new and
>>> existing Flink users alike.
>>>
>>> One reason for not including it in the main repository would to not
>>> being bound to Flink's release cadence. This would allow to release faster
>>> and more often. However, I believe that having it eventually in Flink's
>>> main repository would be beneficial in the long run.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Oct 11, 2019 at 12:56 PM Trevor Grant 
>>> wrote:
>>>
 +1 non-binding on contribution.

 Separate repo, or feature branch to start maybe? I just feel like in
 the beginning this thing is going to have lots of breaking changes that
 maybe aren't going to fit well with tests / other "v1+" release code. Just
 my .02.



 On Fri, Oct 11, 2019 at 4:38 AM Stephan Ewen  wrote:

> Dear Flink Community!
>
> Some of you probably heard it already: On Tuesday, at Flink Forward
> Berlin, we announced **Stateful Functions**.
>
> Stateful Functions is a library on Flink to implement general purpose
> applications. It is built around stateful functions (who would have thunk)
> that can communicate arbitrarily through messages, have consistent
> state, and a small resource footprint. They are a bit like keyed
> ProcessFunctions
> that can send each other messages.
> As simple as this sounds, this means you can now communicate in
> non-DAG patterns, so it allows users to build programs they cannot build
> with Flink.
> It also has other neat properties, like multiplexing of functions,
> modular composition, tooling both container-based deployments and
> as-a-Flink-job deployments.
>
> You can find out more about it here
>   - Website: https://statefun.io/
>   - Code: https://github.com/ververica/stateful-functions
>   - Talk with moti

Re: Per Operator State Monitoring

2019-11-26 Thread Yu Li
Hi Aaron,

I don't think we have such fine grained metrics on per operation state
size, but from your description that "YARN kills containers who are
exceeding their memory limits", I think the root cause is not the state
size but related to the memory consumption of the state backend.

My guess is you are using RocksDB state backend because with heap backend
you won't exceed the Xmx limit of the JVM and could hardly get killed by
Yarn (unless you're spawning huge amount of threads in your operator logic,
in which case it has nothing to do with state). And if I'm correct, could
you carefully check your memory settings to make sure it could cover all
memory usage of RocksDB [1]? You may also find some good solution in
FLINK-7289 [2] to prevent RocksDB memory leak.

We're trying hard to supply a much easier way to control the total memory
of RocksDB backend in 1.10 release (target to be released in Jan. 2020),
and sorry for the trouble to understand some internals of RocksDB for the
time being.

Hope the information helps.

Best Regards,
Yu

[1] https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
[2] https://issues.apache.org/jira/browse/FLINK-7289


On Mon, 25 Nov 2019 at 21:28, Piotr Nowojski  wrote:

> Hi,
>
> I’m not sure if there is some simple way of doing that (maybe some other
> contributors will know more).
>
> There are two potential ideas worth exploring:
> - use periodically triggered save points for monitoring? If I remember
> correctly save points are never incremental
> - use save point input/output format to analyse the content of the save
> point? [1]
>
> I hope that someone else from the community will be able to help more here.
>
> Piotrek
>
> [1] https://flink.apache.org/feature/2019/09/13/state-processor-api.html
>
> On 22 Nov 2019, at 22:48, Aaron Langford 
> wrote:
>
> Hey Flink Community,
>
> I'm working on a Flink application where we are implementing operators
> that extend the RichFlatMap and RichCoFlatMap interfaces. As a result, we
> are working directly with Flink's state API (ValueState, ListState,
> MapState). Something that appears to be extremely valuable is having a way
> to monitor the state size for each operator. My team has already run into a
> few cases where our state has exploded and jobs fail because YARN kills
> containers who are exceeding their memory limits.
>
> It is my understanding that the way to best monitor this kind of thing by
> watching checkpoint size per operator instance. This gets a little
> confusing when doing incremental check-pointing because the numbers
> reported seem to be a delta in state size, not the actual state size at
> that point in time. For my teams application, the total state size is not
> the sum of those deltas. What is the best way to get the total size of a
> checkpoint per operator for each checkpoint?
>
> Additionally, monitoring de-serializing and serializing state in a Flink
> application is something that I haven't seen a great story for yet. It
> seems that some of the really badly written Flink operators tend to do most
> poorly when they demand lots of serde for each record. So giving visibility
> into how well an application is handling these types of operations seems to
> be a valuable guard rail for flink developers. Does anyone have existing
> solutions for this, or are there pointers to some work that can be done to
> improve this story?
>
> Aaron
>
>
>


Re: Status of FLINK-12692 (Support disk spilling in HeapKeyedStateBackend)

2020-01-30 Thread Yu Li
Hi Ken,

Thanks for watching this feature.

Unfortunately yes this didn't make into 1.10, and we will try our best to
complete the upstreaming work in 1.11.0. Meantime, we are actively
preparing a try-out version and will publish it onto flink-packages [1]
once ready. Will also get back here once the package is ready.

Best Regards,
Yu

[1] https://flink-packages.org/


On Thu, 30 Jan 2020 at 08:47, Ken Krugler 
wrote:

> Hi Yu Li,
>
> It looks like this stalled out a bit, from May of last year, and won’t
> make it into 1.10.
>
> I’m wondering if there’s a version in Blink (as a completely separate
> state backend?) that could be tried out?
>
> Thanks,
>
> — Ken
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>


Re: Status of FLINK-12692 (Support disk spilling in HeapKeyedStateBackend)

2020-01-30 Thread Yu Li
btw, the upstreaming work is lagged but not stalled, and the biggest PR
(with more than 6k lines of codes) [1] [2] was already merged on Nov. 2019.

We are sorry about the lagging, but we're still actively working on this
feature, JFYI (smile).

Best Regards,
Yu

[1] https://issues.apache.org/jira/browse/FLINK-12697
[2] https://github.com/apache/flink/pull/9501

On Fri, 31 Jan 2020 at 15:27, Yu Li  wrote:

> Hi Ken,
>
> Thanks for watching this feature.
>
> Unfortunately yes this didn't make into 1.10, and we will try our best to
> complete the upstreaming work in 1.11.0. Meantime, we are actively
> preparing a try-out version and will publish it onto flink-packages [1]
> once ready. Will also get back here once the package is ready.
>
> Best Regards,
> Yu
>
> [1] https://flink-packages.org/
>
>
> On Thu, 30 Jan 2020 at 08:47, Ken Krugler 
> wrote:
>
>> Hi Yu Li,
>>
>> It looks like this stalled out a bit, from May of last year, and won’t
>> make it into 1.10.
>>
>> I’m wondering if there’s a version in Blink (as a completely separate
>> state backend?) that could be tried out?
>>
>> Thanks,
>>
>> — Ken
>>
>> --
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>


[ANNOUNCE] Apache Flink 1.10.0 released

2020-02-12 Thread Yu Li
The Apache Flink community is very happy to announce the release of Apache
Flink 1.10.0, which is the latest major release.

Apache Flink® is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data streaming
applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the improvements
for this new major release:
https://flink.apache.org/news/2020/02/11/release-1.10.0.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Cheers,
Gary & Yu


Re: [ANNOUNCE] Apache Flink 1.10.0 released

2020-02-12 Thread Yu Li
The link for FLINK-14516 is fixed. Thanks for pointing it out Hailu!

Best Regards,
Yu


On Thu, 13 Feb 2020 at 07:19, Hailu, Andreas  wrote:

> Congrats all!
>
>
>
> P.S. I noticed in the release notes that the bullet:
>
>
>
> *[FLINK-14516 <https://issues.apache.org/jira/browse/FLINK-13884>] The
> non-credit-based network flow control code was removed, along with the
> configuration option taskmanager.network.credit.model. Moving forward,
> Flink will always use credit-based flow control.*
>
>
>
> Mistakenly links to FLINK-13884
> <https://issues.apache.org/jira/browse/FLINK-13884> J
>
>
>
>
>
> *From:* Yu Li 
> *Sent:* Wednesday, February 12, 2020 8:31 AM
> *To:* dev ; user ;
> annou...@apache.org
> *Subject:* [ANNOUNCE] Apache Flink 1.10.0 released
>
>
>
> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.10.0, which is the latest major release.
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__flink.apache.org_downloads.html&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=U5iy1Gwl6UtxXQWaezDnF9MCz1FM6mW1BobrJO9TAEk&s=5YXCcQyFvn9OfuoKI7QLJcK43J_8cU302ii2xNL4Uds&e=>
>
> Please check out the release blog post for an overview of the improvements
> for this new major release:
> https://flink.apache.org/news/2020/02/11/release-1.10.0.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__flink.apache.org_news_2020_02_11_release-2D1.10.0.html&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=U5iy1Gwl6UtxXQWaezDnF9MCz1FM6mW1BobrJO9TAEk&s=Crfd70Sbkff-3uGqQ8vNMfY03BSrzVdASF3U6y2nquc&e=>
>
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_secure_ReleaseNote.jspa-3FprojectId-3D12315522-26version-3D12345845&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=U5iy1Gwl6UtxXQWaezDnF9MCz1FM6mW1BobrJO9TAEk&s=yd8LNDY1UClmlfV5qhcH8uGFQYA9mf1wvY0TiKzY5OY&e=>
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Cheers,
> Gary & Yu
>
> --
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>


Re: [ANNOUNCE] Apache Flink 1.10.0 released

2020-02-12 Thread Yu Li
Hi Kristoff,

Thanks for the question.

About Java 11 support, please allow me to quote from our release note [1]:

Lastly, note that the connectors for Cassandra, Hive, HBase, and Kafka
0.8–0.11
have not been tested with Java 11 because the respective projects did not
provide
Java 11 support at the time of the Flink 1.10.0 release

Which is the main reason for us to still make our docker image based on JDK
8.

Hope this answers your question.

Best Regards,
Yu

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/release-notes/flink-1.10.html


On Wed, 12 Feb 2020 at 23:43, KristoffSC 
wrote:

> Hi all,
> I have a small question regarding 1.10
>
> Correct me if I'm wrong, but 1.10 should support Java 11 right?
>
> If so, then I noticed that docker images [1] referenced in [2] are still
> based on openjdk8 not Java 11.
>
> Whats up with that?
>
> P.S.
> Congrats on releasing 1.10 ;)
>
> [1]
>
> https://github.com/apache/flink/blob/release-1.10/flink-container/docker/Dockerfile
> [2]
>
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>


Re: [ANNOUNCE] Jingsong Lee becomes a Flink committer

2020-02-23 Thread Yu Li
Congratulations Jingsong! Well deserved.

Best Regards,
Yu


On Mon, 24 Feb 2020 at 14:10, Congxian Qiu  wrote:

> Congratulations Jingsong!
>
> Best,
> Congxian
>
>
> jincheng sun  于2020年2月24日周一 下午1:38写道:
>
>> Congratulations Jingsong!
>>
>> Best,
>> Jincheng
>>
>>
>> Zhu Zhu  于2020年2月24日周一 上午11:55写道:
>>
>>> Congratulations Jingsong!
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Fabian Hueske  于2020年2月22日周六 上午1:30写道:
>>>
 Congrats Jingsong!

 Cheers, Fabian

 Am Fr., 21. Feb. 2020 um 17:49 Uhr schrieb Rong Rong <
 walter...@gmail.com>:

 > Congratulations Jingsong!!
 >
 > Cheers,
 > Rong
 >
 > On Fri, Feb 21, 2020 at 8:45 AM Bowen Li  wrote:
 >
 > > Congrats, Jingsong!
 > >
 > > On Fri, Feb 21, 2020 at 7:28 AM Till Rohrmann >>> >
 > > wrote:
 > >
 > >> Congratulations Jingsong!
 > >>
 > >> Cheers,
 > >> Till
 > >>
 > >> On Fri, Feb 21, 2020 at 4:03 PM Yun Gao 
 wrote:
 > >>
 > >>>   Congratulations Jingsong!
 > >>>
 > >>>Best,
 > >>>Yun
 > >>>
 > >>> --
 > >>> From:Jingsong Li 
 > >>> Send Time:2020 Feb. 21 (Fri.) 21:42
 > >>> To:Hequn Cheng 
 > >>> Cc:Yang Wang ; Zhijiang <
 > >>> wangzhijiang...@aliyun.com>; Zhenghua Gao ;
 godfrey
 > >>> he ; dev ; user <
 > >>> user@flink.apache.org>
 > >>> Subject:Re: [ANNOUNCE] Jingsong Lee becomes a Flink committer
 > >>>
 > >>> Thanks everyone~
 > >>>
 > >>> It's my pleasure to be part of the community. I hope I can make a
 > better
 > >>> contribution in future.
 > >>>
 > >>> Best,
 > >>> Jingsong Lee
 > >>>
 > >>> On Fri, Feb 21, 2020 at 2:48 PM Hequn Cheng 
 wrote:
 > >>> Congratulations Jingsong! Well deserved.
 > >>>
 > >>> Best,
 > >>> Hequn
 > >>>
 > >>> On Fri, Feb 21, 2020 at 2:42 PM Yang Wang 
 > wrote:
 > >>> Congratulations!Jingsong. Well deserved.
 > >>>
 > >>>
 > >>> Best,
 > >>> Yang
 > >>>
 > >>> Zhijiang  于2020年2月21日周五 下午1:18写道:
 > >>> Congrats Jingsong! Welcome on board!
 > >>>
 > >>> Best,
 > >>> Zhijiang
 > >>>
 > >>> --
 > >>> From:Zhenghua Gao 
 > >>> Send Time:2020 Feb. 21 (Fri.) 12:49
 > >>> To:godfrey he 
 > >>> Cc:dev ; user 
 > >>> Subject:Re: [ANNOUNCE] Jingsong Lee becomes a Flink committer
 > >>>
 > >>> Congrats Jingsong!
 > >>>
 > >>>
 > >>> *Best Regards,*
 > >>> *Zhenghua Gao*
 > >>>
 > >>>
 > >>> On Fri, Feb 21, 2020 at 11:59 AM godfrey he 
 > wrote:
 > >>> Congrats Jingsong! Well deserved.
 > >>>
 > >>> Best,
 > >>> godfrey
 > >>>
 > >>> Jeff Zhang  于2020年2月21日周五 上午11:49写道:
 > >>> Congratulations!Jingsong. You deserve it
 > >>>
 > >>> wenlong.lwl  于2020年2月21日周五 上午11:43写道:
 > >>> Congrats Jingsong!
 > >>>
 > >>> On Fri, 21 Feb 2020 at 11:41, Dian Fu 
 wrote:
 > >>>
 > >>> > Congrats Jingsong!
 > >>> >
 > >>> > > 在 2020年2月21日,上午11:39,Jark Wu  写道:
 > >>> > >
 > >>> > > Congratulations Jingsong! Well deserved.
 > >>> > >
 > >>> > > Best,
 > >>> > > Jark
 > >>> > >
 > >>> > > On Fri, 21 Feb 2020 at 11:32, zoudan  wrote:
 > >>> > >
 > >>> > >> Congratulations! Jingsong
 > >>> > >>
 > >>> > >>
 > >>> > >> Best,
 > >>> > >> Dan Zou
 > >>> > >>
 > >>> >
 > >>> >
 > >>>
 > >>>
 > >>> --
 > >>> Best Regards
 > >>>
 > >>> Jeff Zhang
 > >>>
 > >>>
 > >>>
 > >>> --
 > >>> Best, Jingsong Lee
 > >>>
 > >>>
 > >>>
 >

>>>


Re: FsStateBackend vs RocksDBStateBackend

2020-02-23 Thread Yu Li
Yes FsStateBackend would be the best fit for state access performance in
this case. Just a reminder that FsStateBackend will upload the full dataset
to DFS during checkpointing, so please watch the network bandwidth usage
and make sure it won't become a new bottleneck.

Best Regards,
Yu


On Fri, 21 Feb 2020 at 20:56, Robert Metzger  wrote:

> I would try the FsStateBackend in this scenario, as you have enough memory
> available.
>
> On Thu, Jan 30, 2020 at 5:26 PM Ran Zhang  wrote:
>
>> Hi Gordon,
>>
>> Thanks for your reply! Regarding state size - we are at 200-300gb but we
>> have 120 parallelism which will make each task handle ~2 - 3 gb state.
>> (when we submit the job we are setting tm memory to 15g.) In this scenario
>> what will be the best fit for statebackend?
>>
>> Thanks,
>> Ran
>>
>> On Wed, Jan 29, 2020 at 6:37 PM Tzu-Li (Gordon) Tai 
>> wrote:
>>
>>> Hi Ran,
>>>
>>> On Thu, Jan 30, 2020 at 9:39 AM Ran Zhang 
>>> wrote:
>>>
 Hi all,

 We have a Flink app that uses a KeyedProcessFunction, and in the
 function it requires a ValueState(of TreeSet) and the processElement method
 needs to access and update it. We tried to use RocksDB as our stateBackend
 but the performance is not good, and intuitively we think it was because of
 the serialization / deserialization on each processElement call.

>>>
>>> As you have already pointed out, serialization behaviour is a major
>>> difference between the 2 state backends, and will directly impact
>>> performance due to the extra runtime overhead in RocksDB.
>>> If you plan to continue using the RocksDB state backend, make sure to
>>> use MapState instead of ValueState where possible, since every access to
>>> the ValueState in the RocksDB backend requires serializing / deserializing
>>> the whole value.
>>> For MapState, de-/serialization happens per K-V access. Whether or not
>>> this makes sense would of course depend on your state access pattern.
>>>
>>>
 Then we tried to switch to use FsStateBackend (which keeps the
 in-flight data in the TaskManager’s memory according to doc), and it could
 resolve the performance issue. *So we want to understand better what
 are the tradeoffs in choosing between these 2 stateBackend.* Our
 checkpoint size is 200 - 300 GB in stable state. For now we know one
 benefits of RocksDB is it supports incremental checkpoint, but would love
 to know what else we are losing in choosing FsStateBackend.

>>>
>>> As of now, feature-wise both backends support asynchronous snapshotting,
>>> state schema evolution, and access via the State Processor API.
>>> In the end, the major factor for deciding between the two state backends
>>> would be your expected state size.
>>> That being said, it could be possible in the future that savepoint
>>> formats for the backends are changed to be compatible, meaning that you
>>> will be able to switch between different backends upon restore [1].
>>>
>>>

 Thanks a lot!
 Ran Zhang

>>>
>>> Cheers,
>>> Gordon
>>>
>>>  [1]
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-41%3A+Unify+Binary+format+for+Keyed+State
>>>
>>


Re: [DISCUSS] Drop Savepoint Compatibility with Flink 1.2

2020-02-23 Thread Yu Li
+1 for dropping savepoint compatibility with Flink 1.2.

Best Regards,
Yu


On Sat, 22 Feb 2020 at 22:05, Ufuk Celebi  wrote:

> Hey Stephan,
>
> +1.
>
> Reading over the linked ticket and your description here, I think it makes
> a lot of sense to go ahead with this. Since it's possible to upgrade via
> intermediate Flink releases as a fail-safe I don't have any concerns.
>
> – Ufuk
>
>
> On Fri, Feb 21, 2020 at 4:34 PM Till Rohrmann 
> wrote:
> >
> > +1 for dropping savepoint compatibility with Flink 1.2.
> >
> > Cheers,
> > Till
> >
> > On Thu, Feb 20, 2020 at 6:55 PM Stephan Ewen  wrote:
> >>
> >> Thank you for the feedback.
> >>
> >> Here is the JIRA issue with some more explanation also about the
> background and implications:
> >> https://jira.apache.org/jira/browse/FLINK-16192
> >>
> >> Best,
> >> Stephan
> >>
> >>
> >> On Thu, Feb 20, 2020 at 2:26 PM vino yang 
> wrote:
> >>>
> >>> +1 for dropping Savepoint compatibility with Flink 1.2
> >>>
> >>> Flink 1.2 is quite far away from the latest 1.10. Especially after the
> release of Flink 1.9 and 1.10, the code and architecture have undergone
> major changes.
> >>>
> >>> Currently, I am updating state migration tests for Flink 1.10. I can
> still see some binary snapshot files of version 1.2. If we agree on this
> topic, we may be able to alleviate some of the burdens(remove those binary
> files) when the migration tests would be updated later.
> >>>
> >>> Best,
> >>> Vino
> >>>
> >>> Theo Diefenthal  于2020年2月20日周四
> 下午9:04写道:
> 
>  +1 for dropping compatibility.
> 
>  I personally think that it is very important for a project to keep a
> good pace in developing that old legacy stuff must be dropped from time to
> time. As long as there is an upgrade routine (via going to another flink
> release) that's fine.
> 
>  
>  Von: "Stephan Ewen" 
>  An: "dev" , "user" 
>  Gesendet: Donnerstag, 20. Februar 2020 11:11:43
>  Betreff: [DISCUSS] Drop Savepoint Compatibility with Flink 1.2
> 
>  Hi all!
>  For some cleanup and simplifications, it would be helpful to drop
> Savepoint compatibility with Flink version 1.2. That version was released
> almost three years ago.
> 
>  I would expect that no one uses that old version any more in a way
> that they actively want to upgrade directly to 1.11.
> 
>  Even if, there is still the way to first upgrade to another version
> (like 1.9) and then upgrade to 1.11 from there.
> 
>  Any concerns to drop that support?
> 
>  Best,
>  Stephan
> 
> 
>  --
>  SCOOP Software GmbH - Gut Maarhausen - Eiler Straße 3 P - D-51107 Köln
>  Theo Diefenthal
> 
>  T +49 221 801916-196 - F +49 221 801916-17 - M +49 160 90506575
>  theo.diefent...@scoop-software.de - www.scoop-software.de
>  Sitz der Gesellschaft: Köln, Handelsregister: Köln,
>  Handelsregisternummer: HRB 36625
>  Geschäftsführung: Dr. Oleg Balovnev, Frank Heinen,
>  Martin Müller-Rohde, Dr. Wolfgang Reddig, Roland Scheel
>
>>


Re: [ANNOUNCE] Apache Flink 1.13.2 released

2021-08-09 Thread Yu Li
Thanks Yun Tang for being our release manager and everyone else who made
the release possible!

Best Regards,
Yu


On Fri, 6 Aug 2021 at 13:52, Yun Tang  wrote:

>
> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.13.2, which is the second bugfix release for the Apache Flink 1.13
> series.
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Please check out the release blog post for an overview of the improvements
> for this bugfix release:
> https://flink.apache.org/news/2021/08/06/release-1.13.2.html
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12350218&styleName=&projectId=12315522
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Regards,
> Yun Tang
>


Re: [VOTE] Release 1.8.0, release candidate #3

2019-03-20 Thread Yu Li
-1, observed stably failure on streaming bucketing end-to-end test case in
two different environments (Linux/MacOS) when running with both shaded
hadoop-2.8.3 jar file

and hadoop-2.8.5 dist
, while both env
could pass with hadoop 2.6.5. More details please refer to this comment

in FLINK-11972.

Best Regards,
Yu


On Thu, 21 Mar 2019 at 04:25, jincheng sun  wrote:

> Thanks for the quick fix Aljoscha! The FLINK-11971
>  has been merged.
>
> Cheers,
> Jincheng
>
> Piotr Nowojski  于2019年3月21日周四 上午12:29写道:
>
>> -1 from my side due to performance regression found in the master branch
>> since Jan 29th.
>>
>> In 10% JVM forks it was causing huge performance drop in some of the
>> benchmarks (up to 30-50% reduced throughput), which could mean that one out
>> of 10 task managers could be affected by it. Today we have merged a fix for
>> it [1]. First benchmark run was promising [2], but we have to wait until
>> tomorrow to make sure that the problem was definitely resolved. If that’s
>> the case, I would recommend including it in 1.8.0, because we really do not
>> know how big of performance regression this issue can be in the real world
>> scenarios.
>>
>> Regarding the second regression from mid February. We have found the
>> responsible commit and this one is probably just a false positive. Because
>> of the nature some of the benchmarks, they are running with low number of
>> records (300k). The apparent performance regression was caused by higher
>> initialisation time. When I temporarily increased the number of records to
>> 2M, the regression was gone. Together with Till and Stefan Richter we
>> discussed the potential impact of this longer initialisation time (in the
>> case of said benchmarks initialisation time increased from 70ms to 120ms)
>> and we think that it’s not a critical issue, that doesn’t have to block the
>> release. Nevertheless there might some follow up work for this.
>>
>> [1] https://github.com/apache/flink/pull/8020
>> [2] http://codespeed.dak8s.net:8000/timeline/?ben=tumblingWindow&env=2
>>
>> Piotr Nowojski
>>
>> On 20 Mar 2019, at 10:09, Aljoscha Krettek  wrote:
>>
>> Thanks Jincheng! It would be very good to fix those but as you said, I
>> would say they are not blockers.
>>
>> On 20. Mar 2019, at 09:47, Kurt Young  wrote:
>>
>> +1 (non-binding)
>>
>> Checked items:
>> - checked checksums and GPG files
>> - verified that the source archives do not contains any binaries
>> - checked that all POM files point to the same version
>> - build from source successfully
>>
>> Best,
>> Kurt
>>
>>
>> On Wed, Mar 20, 2019 at 2:12 PM jincheng sun 
>> wrote:
>>
>>> Hi Aljoscha&All,
>>>
>>> When I did the `end-to-end` test for RC3 under Mac OS, I found the
>>> following two problems:
>>>
>>> 1. The verification returned for different `minikube status` is is not
>>> enough for the robustness. The strings returned by different versions of
>>> different platforms are different. the following misjudgment is caused:
>>> When the `Command: start_kubernetes_if_not_ruunning failed` error
>>> occurs, the minikube has actually started successfully. The core reason is
>>> that there is a bug in the `test_kubernetes_embedded_job.sh` script. See
>>> FLINK-11971  for
>>> details.
>>>
>>> 2. Since the difference between 1.8.0 and 1.7.x is that 1.8.x does not
>>> put the `hadoop-shaded` JAR integrated into the dist.  It will cause an
>>> error when the end-to-end test cannot be found with `Hadoop` Related
>>> classes,  such as: `java.lang.NoClassDefFoundError:
>>> Lorg/apache/hadoop/fs/FileSystem`. So we need to improve the end-to-end
>>> test script, or explicitly stated in the README, i.e. end-to-end test need
>>> to add `flink-shaded-hadoop2-uber-.jar` to the classpath. See
>>> FLINK-11972  for
>>> details.
>>>
>>> I think this is not a blocker for release-1.8.0, but I think it would be
>>> better to include those commits in release-1.8 If we still have performance
>>> related bugs should be fixed.
>>>
>>> What do you think?
>>>
>>> Best,
>>> Jincheng
>>>
>>>
>>> Aljoscha Krettek  于2019年3月19日周二 下午7:58写道:
>>>
 Hi All,

 The release process for Flink 1.8.0 is currently ongoing. Please have a
 look at the thread, in case you’re interested in checking your applications
 against this next release of Apache Flink and participate in the process.

 Best,
 Aljoscha

 Begin forwarded message:

 *From: *Aljoscha Krettek 
 *Subjec

Re: [VOTE] Release 1.8.0, release candidate #3

2019-03-21 Thread Yu Li
Thanks @jincheng

@Aljoscha I've just opened FLINK-11990
<https://issues.apache.org/jira/browse/FLINK-11990> for the HDFS
BucketingSink issue with hadoop 2.8. IMHO it might be a blocker for 1.8.0
and need your confirmation. Thanks.

Best Regards,
Yu


On Thu, 21 Mar 2019 at 15:57, jincheng sun  wrote:

> Thanks for the quick fix, Yu. the PR of FLINK-11972
> <https://issues.apache.org/jira/browse/FLINK-11972> has been merged.
>
> Cheers,
> Jincheng
>
> Yu Li  于2019年3月21日周四 上午7:23写道:
>
>> -1, observed stably failure on streaming bucketing end-to-end test case
>> in two different environments (Linux/MacOS) when running with both shaded
>> hadoop-2.8.3 jar file
>> <https://repository.apache.org/content/repositories/orgapacheflink-1213/org/apache/flink/flink-shaded-hadoop2-uber/2.8.3-1.8.0/flink-shaded-hadoop2-uber-2.8.3-1.8.0.jar>
>> and hadoop-2.8.5 dist
>> <http://archive.apache.org/dist/hadoop/core/hadoop-2.8.5/>, while both
>> env could pass with hadoop 2.6.5. More details please refer to this
>> comment
>> <https://issues.apache.org/jira/browse/FLINK-11972?focusedCommentId=16797614&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16797614>
>> in FLINK-11972.
>>
>> Best Regards,
>> Yu
>>
>>
>> On Thu, 21 Mar 2019 at 04:25, jincheng sun 
>> wrote:
>>
>>> Thanks for the quick fix Aljoscha! The FLINK-11971
>>> <https://issues.apache.org/jira/browse/FLINK-11971> has been merged.
>>>
>>> Cheers,
>>> Jincheng
>>>
>>> Piotr Nowojski  于2019年3月21日周四 上午12:29写道:
>>>
>>>> -1 from my side due to performance regression found in the master
>>>> branch since Jan 29th.
>>>>
>>>> In 10% JVM forks it was causing huge performance drop in some of the
>>>> benchmarks (up to 30-50% reduced throughput), which could mean that one out
>>>> of 10 task managers could be affected by it. Today we have merged a fix for
>>>> it [1]. First benchmark run was promising [2], but we have to wait until
>>>> tomorrow to make sure that the problem was definitely resolved. If that’s
>>>> the case, I would recommend including it in 1.8.0, because we really do not
>>>> know how big of performance regression this issue can be in the real world
>>>> scenarios.
>>>>
>>>> Regarding the second regression from mid February. We have found the
>>>> responsible commit and this one is probably just a false positive. Because
>>>> of the nature some of the benchmarks, they are running with low number of
>>>> records (300k). The apparent performance regression was caused by higher
>>>> initialisation time. When I temporarily increased the number of records to
>>>> 2M, the regression was gone. Together with Till and Stefan Richter we
>>>> discussed the potential impact of this longer initialisation time (in the
>>>> case of said benchmarks initialisation time increased from 70ms to 120ms)
>>>> and we think that it’s not a critical issue, that doesn’t have to block the
>>>> release. Nevertheless there might some follow up work for this.
>>>>
>>>> [1] https://github.com/apache/flink/pull/8020
>>>> [2] http://codespeed.dak8s.net:8000/timeline/?ben=tumblingWindow&env=2
>>>>
>>>> Piotr Nowojski
>>>>
>>>> On 20 Mar 2019, at 10:09, Aljoscha Krettek  wrote:
>>>>
>>>> Thanks Jincheng! It would be very good to fix those but as you said, I
>>>> would say they are not blockers.
>>>>
>>>> On 20. Mar 2019, at 09:47, Kurt Young  wrote:
>>>>
>>>> +1 (non-binding)
>>>>
>>>> Checked items:
>>>> - checked checksums and GPG files
>>>> - verified that the source archives do not contains any binaries
>>>> - checked that all POM files point to the same version
>>>> - build from source successfully
>>>>
>>>> Best,
>>>> Kurt
>>>>
>>>>
>>>> On Wed, Mar 20, 2019 at 2:12 PM jincheng sun 
>>>> wrote:
>>>>
>>>>> Hi Aljoscha&All,
>>>>>
>>>>> When I did the `end-to-end` test for RC3 under Mac OS, I found the
>>>>> following two problems:
>>>>>
>>>>> 1. The verification returned for different `minikube status` is is not
>>>>> enough for the robustness. The strings returned by different versions of
>>>>> dif

Re: [VOTE] Release 1.8.0, release candidate #3

2019-03-21 Thread Yu Li
Thanks for the message Aljoscha, let's discuss in JIRA (just replied there).

Best Regards,
Yu


On Thu, 21 Mar 2019 at 21:15, Aljoscha Krettek  wrote:

> Hi Yu,
>
> I commented on the issue. For me both Hadoop 2.8.3 and Hadoop 2.4.1 seem
> to work. Could you have a look at my comment?
>
> I will also cancel this RC because of various issues.
>
> Best,
> Aljoscha
>
> On 21. Mar 2019, at 12:23, Yu Li  wrote:
>
> Thanks @jincheng
>
> @Aljoscha I've just opened FLINK-11990
> <https://issues.apache.org/jira/browse/FLINK-11990> for the HDFS
> BucketingSink issue with hadoop 2.8. IMHO it might be a blocker for 1.8.0
> and need your confirmation. Thanks.
>
> Best Regards,
> Yu
>
>
> On Thu, 21 Mar 2019 at 15:57, jincheng sun 
> wrote:
>
>> Thanks for the quick fix, Yu. the PR of FLINK-11972
>> <https://issues.apache.org/jira/browse/FLINK-11972> has been merged.
>>
>> Cheers,
>> Jincheng
>>
>> Yu Li  于2019年3月21日周四 上午7:23写道:
>>
>>> -1, observed stably failure on streaming bucketing end-to-end test case
>>> in two different environments (Linux/MacOS) when running with both shaded
>>> hadoop-2.8.3 jar file
>>> <https://repository.apache.org/content/repositories/orgapacheflink-1213/org/apache/flink/flink-shaded-hadoop2-uber/2.8.3-1.8.0/flink-shaded-hadoop2-uber-2.8.3-1.8.0.jar>
>>> and hadoop-2.8.5 dist
>>> <http://archive.apache.org/dist/hadoop/core/hadoop-2.8.5/>, while both
>>> env could pass with hadoop 2.6.5. More details please refer to this
>>> comment
>>> <https://issues.apache.org/jira/browse/FLINK-11972?focusedCommentId=16797614&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16797614>
>>> in FLINK-11972.
>>>
>>> Best Regards,
>>> Yu
>>>
>>>
>>> On Thu, 21 Mar 2019 at 04:25, jincheng sun 
>>> wrote:
>>>
>>>> Thanks for the quick fix Aljoscha! The FLINK-11971
>>>> <https://issues.apache.org/jira/browse/FLINK-11971> has been merged.
>>>>
>>>> Cheers,
>>>> Jincheng
>>>>
>>>> Piotr Nowojski  于2019年3月21日周四 上午12:29写道:
>>>>
>>>>> -1 from my side due to performance regression found in the master
>>>>> branch since Jan 29th.
>>>>>
>>>>> In 10% JVM forks it was causing huge performance drop in some of the
>>>>> benchmarks (up to 30-50% reduced throughput), which could mean that one 
>>>>> out
>>>>> of 10 task managers could be affected by it. Today we have merged a fix 
>>>>> for
>>>>> it [1]. First benchmark run was promising [2], but we have to wait until
>>>>> tomorrow to make sure that the problem was definitely resolved. If that’s
>>>>> the case, I would recommend including it in 1.8.0, because we really do 
>>>>> not
>>>>> know how big of performance regression this issue can be in the real world
>>>>> scenarios.
>>>>>
>>>>> Regarding the second regression from mid February. We have found the
>>>>> responsible commit and this one is probably just a false positive. Because
>>>>> of the nature some of the benchmarks, they are running with low number of
>>>>> records (300k). The apparent performance regression was caused by higher
>>>>> initialisation time. When I temporarily increased the number of records to
>>>>> 2M, the regression was gone. Together with Till and Stefan Richter we
>>>>> discussed the potential impact of this longer initialisation time (in the
>>>>> case of said benchmarks initialisation time increased from 70ms to 120ms)
>>>>> and we think that it’s not a critical issue, that doesn’t have to block 
>>>>> the
>>>>> release. Nevertheless there might some follow up work for this.
>>>>>
>>>>> [1] https://github.com/apache/flink/pull/8020
>>>>> [2] http://codespeed.dak8s.net:8000/timeline/?ben=tumblingWindow&env=2
>>>>>
>>>>> Piotr Nowojski
>>>>>
>>>>> On 20 Mar 2019, at 10:09, Aljoscha Krettek 
>>>>> wrote:
>>>>>
>>>>> Thanks Jincheng! It would be very good to fix those but as you said, I
>>>>> would say they are not blockers.
>>>>>
>>>>> On 20. Mar 2019, at 09:47, Kurt Young  wrote:
>>>>>
>>>>> +1 (non-binding)
>>>>&

Re: RocksDB local snapshot sliently disappears and cause checkpoint to fail

2019-03-28 Thread Yu Li
Hi Paul,

Regarding "mistakenly uses the default filesystem scheme, which is
specified to hdfs in the new cluster in my case", could you further clarify
the configuration property and value you're using? Do you mean you're using
an HDFS directory to store the local snapshot data? Thanks.

Best Regards,
Yu


On Thu, 28 Mar 2019 at 14:34, Paul Lam  wrote:

> Hi,
>
> It turns out that under certain circumstances rocksdb statebackend
> mistakenly uses the default filesystem scheme, which is specified to hdfs
> in the new cluster in my case.
>
> I’ve filed a Jira to track this[1].
>
> [1] https://issues.apache.org/jira/browse/FLINK-12042
>
> Best,
> Paul Lam
>
> 在 2019年3月27日,19:06,Paul Lam  写道:
>
> Hi,
>
> I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb
> statebackend. A job that runs fine on a YARN cluster keeps failing on
> checkpoint after migrated to a new one
> (with almost everything the same but better machines), and even a clean
> restart doesn’t help.
>
> The root cause is IllegalStateException but with no error message. The
> stack trace shows that when the rocksdb statebackend is doing the async
> part of snapshots (runSnapshot),
> it finds that the local snapshot directory that is created by rocksdb
> earlier (takeSnapshot) does not exist.
>
> I tried to log more informations in RocksDBKeyedStateBackend (see
> attachment), and found that the local snapshot performed as expected and
> the .sst files were written,
> but when the async task accessed the directory, the whole snapshot
> directory was gone.
>
> What could possibly be the cause? Thanks a lot.
>
> Best,
> Paul Lam
>
> 
>
>
>


Re: RocksDB local snapshot sliently disappears and cause checkpoint to fail

2019-03-28 Thread Yu Li
Ok, much clearer now. Thanks.

Best Regards,
Yu


On Thu, 28 Mar 2019 at 15:59, Paul Lam  wrote:

> Hi Yu,
>
> I’ve set `fs.default-scheme` to hdfs, and it's mainly used for simplifying
> checkpoint / savepoint / HA paths.
>
> And I leave the rocksdb local dir empty, so the local snapshot still goes
> to YARN local cache dirs.
>
> Hope that answers your question.
>
> Best,
> Paul Lam
>
> 在 2019年3月28日,15:34,Yu Li  写道:
>
> Hi Paul,
>
> Regarding "mistakenly uses the default filesystem scheme, which is
> specified to hdfs in the new cluster in my case", could you further clarify
> the configuration property and value you're using? Do you mean you're using
> an HDFS directory to store the local snapshot data? Thanks.
> Best Regards,
> Yu
>
>
>
> On Thu, 28 Mar 2019 at 14:34, Paul Lam  wrote:
>
>> Hi,
>>
>> It turns out that under certain circumstances rocksdb statebackend
>> mistakenly uses the default filesystem scheme, which is specified to hdfs
>> in the new cluster in my case.
>>
>> I’ve filed a Jira to track this[1].
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-12042
>>
>> Best,
>> Paul Lam
>>
>> 在 2019年3月27日,19:06,Paul Lam  写道:
>>
>> Hi,
>>
>> I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb
>> statebackend. A job that runs fine on a YARN cluster keeps failing on
>> checkpoint after migrated to a new one
>> (with almost everything the same but better machines), and even a clean
>> restart doesn’t help.
>>
>> The root cause is IllegalStateException but with no error message. The
>> stack trace shows that when the rocksdb statebackend is doing the async
>> part of snapshots (runSnapshot),
>> it finds that the local snapshot directory that is created by rocksdb
>> earlier (takeSnapshot) does not exist.
>>
>> I tried to log more informations in RocksDBKeyedStateBackend (see
>> attachment), and found that the local snapshot performed as expected and
>> the .sst files were written,
>> but when the async task accessed the directory, the whole snapshot
>> directory was gone.
>>
>> What could possibly be the cause? Thanks a lot.
>>
>> Best,
>> Paul Lam
>>
>> 
>>
>>
>>


Re: End to End Performance Testing

2019-04-03 Thread Yu Li
A short answer is no, the test platform is not included in the blink branch.

FWIW (to make it clear, I'm not the decision maker so just some of my own
opinions), the test platform is for production usage and includes
simulation of online jobs and probably some confidential stuff, so I don't
think it could be rawly put out. However, there're indeed ongoing plan to
extract test cases, benchmarks, etc. from the platform and contribute to
Flink community, to enrich the Flink test suite. Please watch the news and
discussions in flink mailing list for the progress (smile).

Best Regards,
Yu


On Tue, 2 Apr 2019 at 18:49, WILSON Frank 
wrote:

> Hi,
>
>
>
> I am looking for resources to help me test the performance of my Flink
> pipelines end-to-end. I want to verify that my pipelines meet throughput
> and latency requirements (so for a given number of events per second the
> latency of the output is under so many seconds). I read that Alibaba had
> developed a Blink test platform that offers performance and stability
> testing [1] that seem to be relevant to this problem.
>
>
>
> I read also that Blink has been contributed back to Flink [2] and I was
> wondering if the test platform is also included in the branch [3]?
>
>
>
> Thanks,
>
>
>
>
>
> Frank Wilson
>
>
>
>
>
> [1]
> https://hackernoon.com/from-code-quality-to-integration-optimizing-alibabas-blink-testing-framework-dc9c357319de#190e
>
>
>
> [2]
> https://flink.apache.org/news/2019/02/13/unified-batch-streaming-blink.html
>
>
>
> [3] https://github.com/apache/flink/tree/blink
>
>
>


Re: [ANNOUNCE] Apache Flink 1.8.0 released

2019-04-12 Thread Yu Li
Thanks Aljoscha and all for making this happen, I believe 1.8.0 will be a
great and successful release.

Best Regards,
Yu


On Fri, 12 Apr 2019 at 21:23, Patrick Lucas  wrote:

> The Docker images for 1.8.0 are now available.
>
> Note that like Flink itself, starting with Flink 1.8 we are no longer
> providing images with bundled Hadoop. Users who need an image with bundled
> Hadoop should create their own, starting from the images published on
> Docker Hub and fetching the appropriate shaded Hadoop jar as listed on the
> Flink website's download page.
>
> If there is demand, we can also look into creating a series of "downstream"
> images that include Hadoop.
>
> --
> Patrick
>
> On Thu, Apr 11, 2019 at 2:05 PM Richard Deurwaarder 
> wrote:
>
> > Very nice! Thanks Aljoscha and all contributors!
> >
> > I have one question, will the docker image for 1.8.0 be released soon as
> > well? https://hub.docker.com/_/flink has the versions up to 1.7.2.
> >
> > Regards,
> >
> > Richard
> >
> > On Wed, Apr 10, 2019 at 4:54 PM Rong Rong  wrote:
> >
> >> Congrats! Thanks Aljoscha for being the release manager and all for
> >> making the release possible.
> >>
> >> --
> >> Rong
> >>
> >>
> >> On Wed, Apr 10, 2019 at 4:23 AM Stefan Richter  >
> >> wrote:
> >>
> >>> Congrats and thanks to Aljoscha for managing the release!
> >>>
> >>> Best,
> >>> Stefan
> >>>
> >>> > On 10. Apr 2019, at 13:01, Biao Liu  wrote:
> >>> >
> >>> > Great news! Thanks Aljoscha and all the contributors.
> >>> >
> >>> > Till Rohrmann mailto:trohrm...@apache.org>>
> >>> 于2019年4月10日周三 下午6:11写道:
> >>> > Thanks a lot to Aljoscha for being our release manager and to the
> >>> community making this release possible!
> >>> >
> >>> > Cheers,
> >>> > Till
> >>> >
> >>> > On Wed, Apr 10, 2019 at 12:09 PM Hequn Cheng  >>> > wrote:
> >>> > Thanks a lot for the great release Aljoscha!
> >>> > Also thanks for the work by the whole community. :-)
> >>> >
> >>> > Best, Hequn
> >>> >
> >>> > On Wed, Apr 10, 2019 at 6:03 PM Fabian Hueske  >>> > wrote:
> >>> > Congrats to everyone!
> >>> >
> >>> > Thanks Aljoscha and all contributors.
> >>> >
> >>> > Cheers, Fabian
> >>> >
> >>> > Am Mi., 10. Apr. 2019 um 11:54 Uhr schrieb Congxian Qiu <
> >>> qcx978132...@gmail.com >:
> >>> > Cool!
> >>> >
> >>> > Thanks Aljoscha a lot for being our release manager, and all the
> >>> others who make this release possible.
> >>> >
> >>> > Best, Congxian
> >>> > On Apr 10, 2019, 17:47 +0800, Jark Wu  >>> imj...@gmail.com>>, wrote:
> >>> > > Cheers!
> >>> > >
> >>> > > Thanks Aljoscha and all others who make 1.8.0 possible.
> >>> > >
> >>> > > On Wed, 10 Apr 2019 at 17:33, vino yang  >>> > wrote:
> >>> > >
> >>> > > > Great news!
> >>> > > >
> >>> > > > Thanks Aljoscha for being the release manager and thanks to all
> the
> >>> > > > contributors!
> >>> > > >
> >>> > > > Best,
> >>> > > > Vino
> >>> > > >
> >>> > > > Driesprong, Fokko  于2019年4月10日周三 下午4:54写道:
> >>> > > >
> >>> > > > > Great news! Great effort by the community to make this happen.
> >>> Thanks all!
> >>> > > > >
> >>> > > > > Cheers, Fokko
> >>> > > > >
> >>> > > > > Op wo 10 apr. 2019 om 10:50 schreef Shaoxuan Wang <
> >>> wshaox...@gmail.com >:
> >>> > > > >
> >>> > > > > > Thanks Aljoscha and all others who made contributions to
> FLINK
> >>> 1.8.0.
> >>> > > > > > Looking forward to FLINK 1.9.0.
> >>> > > > > >
> >>> > > > > > Regards,
> >>> > > > > > Shaoxuan
> >>> > > > > >
> >>> > > > > > On Wed, Apr 10, 2019 at 4:31 PM Aljoscha Krettek <
> >>> aljos...@apache.org >
> >>> > > > > > wrote:
> >>> > > > > >
> >>> > > > > > > The Apache Flink community is very happy to announce the
> >>> release of
> >>> > > > > > Apache
> >>> > > > > > > Flink 1.8.0, which is the next major release.
> >>> > > > > > >
> >>> > > > > > > Apache Flink® is an open-source stream processing framework
> >>> for
> >>> > > > > > > distributed, high-performing, always-available, and
> accurate
> >>> data
> >>> > > > > > streaming
> >>> > > > > > > applications.
> >>> > > > > > >
> >>> > > > > > > The release is available for download at:
> >>> > > > > > > https://flink.apache.org/downloads.html <
> >>> https://flink.apache.org/downloads.html>
> >>> > > > > > >
> >>> > > > > > > Please check out the release blog post for an overview of
> the
> >>> > > > > > improvements
> >>> > > > > > > for this bugfix release:
> >>> > > > > > >
> https://flink.apache.org/news/2019/04/09/release-1.8.0.html
> >>> 
> >>> > > > > > >
> >>> > > > > > > The full release notes are available in Jira:
> >>> > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12344274
> >>> <
> >>>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proj

Re: [Discuss] Semantics of event time for state TTL

2019-04-15 Thread Yu Li
Thanks for initiating the discussion and wrap-up the conclusion Andrey, and
thanks all for participating.

Just to confirm, that for the out-of-order case, the conclusion is to
update the data and timestamp with the currently-being-processed record w/o
checking whether it's an old data, right? In this way we could save the
comparison-on-each-record cost while may delete data earlier than its
time-to-live, seems a fair trade-off for me but not sure whether this could
satisfy all real-world demand. Anyway I think it's fine to keep-it-as-is
and discuss/improve if any user requirement emerges later.

Will start/continue the development/implementation following the conclusion
if no objections. Thanks.

Best Regards,
Yu


On Mon, 15 Apr 2019 at 21:58, Andrey Zagrebin  wrote:

> Hi everybody,
>
> Thanks a lot for your detailed feedback on this topic.
> It looks like we can already do some preliminary wrap-up for this
> discussion.
>
> As far as I see we have the following trends:
>
> *Last access timestamp: **Event timestamp of currently being processed
> record*
>
> *Current timestamp to check expiration, *two options:
> - *Last emitted watermark*
> *- **Current processing time*
>
> From the implementation perspective, it does not seem to be a big deal to
> make it configurable as we already have processing time provider. Although,
> it looks like our TtlTimeProvider would need two methods from now on:
> getAccessTimestamp and getCurrentTimestamp.
>
> The biggest concern is out-of-orderness problem. In general, from Flink
> side it does not look that we can do a lot about it except putting again a
> caveat into the user docs about it. It depends on the use case whether the
> out-of-orderness can be tolerated or not and whether an additional stream
> ordering operator needs to be put before TTL state access.
>
> I would still consider TTL event time feature to be implemented as we have
> some user requests for it. Any further feedback is appreciated.
>
> Best,
> Andrey
>
> On Tue, Apr 9, 2019 at 5:26 PM aitozi  wrote:
>
>> Hi, Andrey
>>
>> I think ttl state has another scenario to simulate the slide window with
>> the
>> process function. User can define a state to store the data with the
>> latest
>> 1 day. And trigger calculate on the state every 5min. It is a operator
>> similar to slidewindow. But i think it is more efficient than the
>> slidewindow because it dont have to store the redundant data and the
>> expire
>> data can be delete automatic.
>>
>> However with the support of ttl state based on processing time we can just
>> implement the processing slide window. If we can support ttl based on
>> event
>> time I think we can expand this capacity.
>>
>> So in this scenario, the event-time-accesstime/watermark-expiration-check
>> will be the proper combination.
>>
>> I think if can add the interface to allow user to custom will be
>> flexible.
>>
>> Thanks,
>> Aitozi
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>


[DISCUSS] Proposal to support disk spilling in HeapKeyedStateBackend

2019-05-24 Thread Yu Li
Hi All,

As mentioned in our speak[1] given in FlinkForwardChina2018, we have
improved HeapKeyedStateBackend to support disk spilling and put it in
production here in Alibaba for last year's Singles' Day. Now we're ready to
upstream our work and the design doc is up for review[2]. Please let us
know your point of the feature and any comment is welcomed/appreciated.

We plan to keep the discussion open for at least 72 hours, and will create
umbrella jira and subtasks if no objections. Thanks.

Below is a brief description about the motivation of the work, FYI:


*HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink, since
state lives as Java objects on the heap in HeapKeyedStateBackend and the
de/serialization only happens during state snapshot and restore, it
outperforms RocksDBKeyeStateBackend when all data could reside in
memory.**However,
along with the advantage, HeapKeyedStateBackend also has its shortcomings,
and the most painful one is the difficulty to estimate the maximum heap
size (Xmx) to set, and we will suffer from GC impact once the heap memory
is not enough to hold all state data. There’re several (inevitable) causes
for such scenario, including (but not limited to):*



** Memory overhead of Java object representation (tens of times of the
serialized data size).* Data flood caused by burst traffic.* Data
accumulation caused by source malfunction.**To resolve this problem, we
proposed a solution to support spilling state data to disk before heap
memory is exhausted. We will monitor the heap usage and choose the coldest
data to spill, and reload them when heap memory is regained after data
removing or TTL expiration, automatically.*

[1] https://files.alicdn.com/tpsservice/1df9ccb8a7b6b2782a558d3c32d40c19.pdf
[2]
https://docs.google.com/document/d/1rtWQjIQ-tYWt0lTkZYdqTM6gQUleV8AXrfTOyWUZMf4/edit?usp=sharing

Best Regards,
Yu


Re: [DISCUSS] Deprecate previous Python APIs

2019-06-14 Thread Yu Li
+1 on removing plus an explicit NOTE thread, to prevent any neglection due
to the current title (deprecation).

Best Regards,
Yu


On Fri, 14 Jun 2019 at 18:09, Stephan Ewen  wrote:

> Okay, so we seem to have consensus for at least deprecating them, with a
> suggestion to even directly remove them.
>
> A previous survey also brought no users of that python API to light [1]
> I am inclined to go with removing.
> Typically, deprecation is the way to go, but we could make an exception
> and expedite things here.
>
> [1]
> https://lists.apache.org/thread.html/348366080d6b87bf390efb98e5bf268620ab04a0451f8459e2f466cd@%3Cdev.flink.apache.org%3E
>
>
> On Wed, Jun 12, 2019 at 2:37 PM Chesnay Schepler 
> wrote:
>
>> I would just remove them. As you said, there are very limited as to what
>> features they support, and haven't been under active development for
>> several releases.
>>
>> Existing users (if there even are any) could continue to use older
>> version against newer releases. It's is slightly more involved than for
>> say, flink-ml, as you also have to copy the start-scripts (or figure out
>> how to use the jars yourself), but it is still feasible and can be
>> documented in the release notes.
>>
>> On 11/06/2019 15:30, Stephan Ewen wrote:
>> > Hi all!
>> >
>> > I would suggest to deprecating the existing python APIs for DataSet and
>> > DataStream API with the 1.9 release.
>> >
>> > Background is that there is a new Python API under development.
>> > The new Python API is initially against the Table API. Flink 1.9 will
>> > support Table API programs without UDFs, 1.10 is planned to support
>> UDFs.
>> > Future versions would support also the DataStream API.
>> >
>> > In the long term, Flink should have one Python API for DataStream and
>> Table
>> > APIs. We should not maintain multiple different implementations and
>> confuse
>> > users that way.
>> > Given that the existing Python APIs are a bit limited and not under
>> active
>> > development, I would suggest to deprecate them in favor of the new API.
>> >
>> > Best,
>> > Stephan
>> >
>>
>>


Re: Queryable state and TTL

2019-07-03 Thread Yu Li
Thanks for the ping Andrey.

For me the general answer is yes, but TBH it will probably not be added in
the foreseeable future due to lack of committer bandwidth (not only
QueryableState with TTL but all about QueryableState module) as per
Aljoscha pointed out in another thread [1].

Although we could see emerging requirements and proposals on QueryableState
recently, prioritizing is important for each open source project. And
personally I think it may help if we could gather more and clearly describe
the other-than-debugging use cases of QueryableState in production [2].
Could you share your case with us and why QueryableState is necessary
rather than querying the data from sink @Avi? Thanks.

[1] https://s.apache.org/MaOl
[2] https://s.apache.org/hJDA

Best Regards,
Yu


On Wed, 3 Jul 2019 at 23:13, Andrey Zagrebin  wrote:

> Hi Avi,
>
> It is on the road map but I am not aware about plans of any contributor to
> work on it for the next releases.
> I think the community will firstly work on the event time support for TTL.
> I will loop Yu in, maybe he has some plans to work on TTL for the
> queryable state.
>
> Best,
> Andrey
>
> On Wed, Jul 3, 2019 at 3:17 PM Avi Levi  wrote:
>
>> Hi,
>> Adding queryable state to state with ttl is not supported at 1.8.0
>> (throwing java.lang.IllegalArgumentException: Queryable state is currently
>> not supported with TTL)
>>
>> I saw in previous mailing thread
>> that
>> it is on the roadmap. Is it still on the roadmap ?
>>
>> * There is a workaround which is using timers to clear the state, but in
>> our case, it means firing billons of timers on daily basis all at the same
>> time, which seems no to very efficient and might cause some resources
>> issues
>>
>> Cheers
>> Avi
>>
>>
>>


Re: [ANNOUNCE] Rong Rong becomes a Flink committer

2019-07-11 Thread Yu Li
Congratulations Rong!

Best Regards,
Yu


On Thu, 11 Jul 2019 at 22:54, zhijiang  wrote:

> Congratulations Rong!
>
> Best,
> Zhijiang
>
> --
> From:Kurt Young 
> Send Time:2019年7月11日(星期四) 22:54
> To:Kostas Kloudas 
> Cc:Jark Wu ; Fabian Hueske ; dev <
> d...@flink.apache.org>; user 
> Subject:Re: [ANNOUNCE] Rong Rong becomes a Flink committer
>
> Congratulations Rong!
>
> Best,
> Kurt
>
>
> On Thu, Jul 11, 2019 at 10:53 PM Kostas Kloudas 
> wrote:
> Congratulations Rong!
>
> On Thu, Jul 11, 2019 at 4:40 PM Jark Wu  wrote:
> Congratulations Rong Rong!
> Welcome on board!
>
> On Thu, 11 Jul 2019 at 22:25, Fabian Hueske  wrote:
> Hi everyone,
>
> I'm very happy to announce that Rong Rong accepted the offer of the Flink
> PMC to become a committer of the Flink project.
>
> Rong has been contributing to Flink for many years, mainly working on SQL
> and Yarn security features. He's also frequently helping out on the
> user@f.a.o mailing lists.
>
> Congratulations Rong!
>
> Best, Fabian
> (on behalf of the Flink PMC)
>
>
>


Re: Bandwidth throttling of checkpoints uploading to s3

2019-07-12 Thread Yu Li
Hi Pavel,

Currently there's no such throttling functionality in Flink and I think
it's a valid requirement. But before opening a JIRA for this, please allow
me to ask for more details to better understand your scenario:
1. What kind of state backend are you using? Since you observe high load to
network, I guess the state is large and you are using RocksDB backend?
2. If you're using RocksDB backend, have you configured to use incremental
checkpoint? To be more specified, have you set the
"state.backend.incremental" property to true? (by default it's false)
3. If you're using RocksDB backend with full checkpoint, what the
incremental size of checkpoint would be (within a checkpoint interval)?
4. What's the max bandwidth you'd like to throttle to for S3?

Asking because if you're using RocksDB with full checkpoint and the
incremental checkpoint size is as small as not exceeding your expected
throttle for S3, you could directly try incremental checkpoint to resolve
the current problem.

Thanks.

Best Regards,
Yu


On Fri, 12 Jul 2019 at 20:39, Pavel Potseluev 
wrote:

> Hello!
>
> We use flink with periodically checkpointing to s3 file system. And when
> flink uploads checkpoint to s3 it makes high load to the network. We have
> found in AWS CLI S3 configuration option called max_bandwidth
>  
> which
> allows to limit rate in bytes per second. Is there a way to have the same
> functionality with flink?
>
> --
> Best regards,
> Pavel Potseluev
> Software developer, Yandex.Classifieds LLC
>
>


Re: Bandwidth throttling of checkpoints uploading to s3

2019-07-12 Thread Yu Li
Thanks for the information Pavel, good to know.

And I've created FLINK-13251
<https://issues.apache.org/jira/browse/FLINK-13251> to introduce the
checkpoint bandwidth throttling feature, FYI.

Best Regards,
Yu


On Sat, 13 Jul 2019 at 00:11, Павел Поцелуев 
wrote:

>
>1. We use FsStateBackend and state snapshot size is about 700 Mbyte.
>2. We are thinking about migration to RocksDBStateBackend and turning
>on incremental checkpoints.
>3. I think incremental size would be small in our current use case so
>incremental checkpoints can solve the problem.
>4. I think it is about 50 Mbit/s.
>
> Thanks.
>
> 12.07.2019, 17:27, "Yu Li" :
>
> Hi Pavel,
>
> Currently there's no such throttling functionality in Flink and I think
> it's a valid requirement. But before opening a JIRA for this, please allow
> me to ask for more details to better understand your scenario:
> 1. What kind of state backend are you using? Since you observe high load
> to network, I guess the state is large and you are using RocksDB backend?
> 2. If you're using RocksDB backend, have you configured to use incremental
> checkpoint? To be more specified, have you set the
> "state.backend.incremental" property to true? (by default it's false)
> 3. If you're using RocksDB backend with full checkpoint, what the
> incremental size of checkpoint would be (within a checkpoint interval)?
> 4. What's the max bandwidth you'd like to throttle to for S3?
>
> Asking because if you're using RocksDB with full checkpoint and the
> incremental checkpoint size is as small as not exceeding your expected
> throttle for S3, you could directly try incremental checkpoint to resolve
> the current problem.
>
> Thanks.
>
> Best Regards,
> Yu
>
> On Fri, 12 Jul 2019 at 20:39, Pavel Potseluev 
> wrote:
>
> Hello!
>
> We use flink with periodically checkpointing to s3 file system. And when
> flink uploads checkpoint to s3 it makes high load to the network. We have
> found in AWS CLI S3 configuration option called max_bandwidth
> <https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#max-bandwidth> 
> which
> allows to limit rate in bytes per second. Is there a way to have the same
> functionality with flink?
>
> --
> Best regards,
> Pavel Potseluev
> Software developer, Yandex.Classifieds LLC
>
>
>
>
> --
> Best regards,
> Pavel Potseluev
> Software developer, Yandex.Classifieds LLC
>
>


Re: Memory constrains running Flink on Kubernetes

2019-07-29 Thread Yu Li
For the memory usage of RocksDB, there's already some discussion in
FLINK-7289  and a good
suggestion

from Mike to use the WriteBufferManager to limit the total memory usage,
FYI.

We will drive to make the memory management of state backends more "hands
free" in latter release (probably in release 1.10) and please watch the
release plan and/or the weekly community update [1] threads.

[1] https://s.apache.org/ix7iv

Best Regards,
Yu


On Thu, 25 Jul 2019 at 15:12, Yun Tang  wrote:

> Hi
>
> It's definitely not easy to calculate the accurate memory usage of
> RocksDB, but formula of "block-cache-memory + column-family-number *
> write-buffer-memory * write-buffer-number + index&filter memory"  should
> give enough sophisticated hints.
> When talking about the column-family-number, they are equals to the number
> of your states which are the declared state descriptors in one operator and
> potential one window state (if you're using window).
> The default writer-buffer-number is 2 at most for each column family, and
> the default write-buffer-memory size is 4MB. Pay attention that if you ever
> configure the options for RocksDB, these memory usage would differ from
> default values.
> The last part of index&filter memory is not easy to estimate, but from my
> experience this part of memory would not occupy too much only if you have
> many open files.
>
> Last but not least, Flink would enable slot sharing by default, and even
> if you only one slot per taskmanager, there might exists many RocksDB
> within that TM due to many operator with keyed state running.
>
> Apart from the theoretical analysis, you'd better to open RocksDB native
> metrics or track the memory usage of pods through Prometheus with k8s.
>
> Best
> Yun Tang
> --
> *From:* wvl 
> *Sent:* Thursday, July 25, 2019 17:50
> *To:* Yang Wang 
> *Cc:* Yun Tang ; Xintong Song ;
> user 
> *Subject:* Re: Memory constrains running Flink on Kubernetes
>
> Thanks for all the answers so far.
>
> Especially clarifying was that RocksDB memory usage isn't accounted for in
> the flink memory metrics. It's clear that we need to experiment to
> understand it's memory usage and knowing that we should be looking at the
> container memory usage minus all the jvm managed memory, helps.
>
> In mean while, we've set MaxMetaspaceSize to 200M based on our metrics.
> Sadly the resulting OOM does not result a better behaved job, because it
> would seem that the (taskmanager) JVM itself is not restarted - which makes
> sense in a multijob environment.
> So we're looking into ways to simply prevent this metaspace growth (job
> library jars in /lib on TM).
>
> Going back to RocksDB, the given formula "block-cache-memory +
> column-family-number * write-buffer-memory * write-buffer-number +
> index&filter memory." isn't completely clear to me.
>
> Block Cache: "Out of box, RocksDB will use LRU-based block cache
> implementation with 8MB capacity"
> Index & Filter Cache: "By default index and filter blocks are cached
> outside of block cache, and users won't be able to control how much memory
> should be use to cache these blocks, other than setting max_open_files.".
> The default settings doesn't set max_open_files and the rocksdb default
> seems to be 1000 (
> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/leveldb_options.h#L89)
> .. not completely sure about this.
> Write Buffer Memory: "The default is 64 MB. You need to budget for 2 x
> your worst case memory use."
>
> May I presume a unique ValueStateDescriptor equals a Column Family?
> If so, say I have 10 of those.
> 8MB + (10 * 64 * 2) + $Index&FilterBlocks
>
> So is that correct and how would one calculate $Index&FilterBlocks? The
> docs suggest a relationship between max_open_files (1000) and the amount
> index/filter of blocks that can be cached, but is this a 1 to 1
> relationship? Anyway, this concept of blocks is very unclear.
>
> > Have you ever set the memory limit of your taskmanager pod when
> launching it in k8s?
>
> Definitely. We settled on 8GB pods with taskmanager.heap.size: 5000m and 1
> slot and were looking into downsizing a bit to improve our pod to VM ratio.
>
> On Wed, Jul 24, 2019 at 11:07 AM Yang Wang  wrote:
>
> Hi,
>
>
> The heap in a flink TaskManager k8s pod include the following parts:
>
>- jvm heap, limited by -Xmx
>- jvm non-heap, limited by -XX:MaxMetaspaceSize
>- jvm direct memory, limited by -XX:MaxDirectMemorySize
>- native memory, used by rocksdb, just as Yun Tang said, could be
>limited by rocksdb configurations
>
>
> So if your k8s pod is terminated by OOMKilled, the cause may be the
> non-heap memory or native memory. I suggest you add an environment
> FLINK_ENV_JAVA_OPTS_TM="-XX:MaxMetaspaceSize

Re: Graceful Task Manager Termination and Replacement

2019-07-29 Thread Yu Li
Belated but FWIW, besides the region failover and best-efforts failover
efforts, I believe stop with checkpoint as proposed in FLINK-12619 and
FLIP-45 could also help here, FYI.

W.r.t k8s, there're also some offline discussion about supporting local
recovery with persistent volume even when task assigned to other TMs during
job failover.

[1] https://issues.apache.org/jira/browse/FLINK-12619
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic

Best Regards,
Yu


On Wed, 24 Jul 2019 at 17:00, Aaron Levin  wrote:

> I was on vacation but wanted to thank Biao for summarizing the current
> state! Thanks!
>
> On Mon, Jul 15, 2019 at 2:00 AM Biao Liu  wrote:
>
>> Hi Aaron,
>>
>> From my understanding, you want shutting down a Task Manager without
>> restart the job which has tasks running on this Task Manager?
>>
>> Based on current implementation, if there is a Task Manager is down, the
>> tasks on it would be treated as failed. The behavior of task failure is
>> defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
>> That's the reason why the whole job restarts when a Task Manager has
>> gone. As Paul said, you could try "region restart failover strategy" when
>> 1.9 is released. It might be helpful however it depends on your job
>> topology.
>>
>> The deeper reason of this issue is the consistency semantics of Flink,
>> AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there
>> is no much choice of `FailoverStrategy`.
>> It might be improved in the future. There are some discussions in the
>> mailing list that providing some weaker consistency semantics to improve
>> the `FailoverStrategy`. We are pushing forward this improvement. I hope it
>> can be included in 1.10.
>>
>> Regarding your question, I guess the answer is no for now. A more
>> frequent checkpoint or a savepoint manually triggered might be helpful by a
>> quicker recovery.
>>
>>
>> Paul Lam  于2019年7月12日周五 上午10:25写道:
>>
>>> Hi,
>>>
>>> Maybe region restart strategy can help. It restarts minimum required
>>> tasks. Note that it’s recommended to use only after 1.9 release, see [1],
>>> unless you’re running a stateless job.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10712
>>>
>>> Best,
>>> Paul Lam
>>>
>>> 在 2019年7月12日,03:38,Aaron Levin  写道:
>>>
>>> Hello,
>>>
>>> Is there a way to gracefully terminate a Task Manager beyond just
>>> killing it (this seems to be what `./taskmanager.sh stop` does)?
>>> Specifically I'm interested in a way to replace a Task Manager that has
>>> currently-running tasks. It would be great if it was possible to terminate
>>> a Task Manager without restarting the job, though I'm not sure if this is
>>> possible.
>>>
>>> Context: at my work we regularly cycle our hosts for maintenance and
>>> security. Each time we do this we stop the task manager running on the host
>>> being cycled. This causes the entire job to restart, resulting in downtime
>>> for the job. I'd love to decrease this downtime if at all possible.
>>>
>>> Thanks! Any insight is appreciated!
>>>
>>> Best,
>>>
>>> Aaron Levin
>>>
>>>
>>>


[ANNOUNCE] Call for Presentations for ApacheCon Asia 2022 streaming track

2022-05-18 Thread Yu Li
Hi everyone,

ApacheCon Asia [1] will feature the Streaming track for the second year.
Please don't hesitate to submit your proposal if there is an interesting
project or Flink experience you would like to share with us!

The conference will be online (virtual) and the talks will be pre-recorded.
The deadline of proposal submission is at the end of this month (May 31st).

See you all there :)

Best Regards,
Yu

[1] https://apachecon.com/acasia2022/cfp.html


[ANNOUNCE] Flink Table Store Joins Apache Incubator as Apache Paimon(incubating)

2023-03-27 Thread Yu Li
Dear Flinkers,


As you may have noticed, we are pleased to announce that Flink Table
Store has joined the Apache Incubator as a separate project called
Apache Paimon(incubating) [1] [2] [3]. The new project still aims at
building a streaming data lake platform for high-speed data ingestion,
change data tracking and efficient real-time analytics, with the
vision of supporting a larger ecosystem and establishing a vibrant and
neutral open source community.


We would like to thank everyone for their great support and efforts
for the Flink Table Store project, and warmly welcome everyone to join
the development and activities of the new project. Apache Flink will
continue to be one of the first-class citizens supported by Paimon,
and we believe that the Flink and Paimon communities will maintain
close cooperation.


亲爱的Flinkers,


正如您可能已经注意到的,我们很高兴地宣布,Flink Table Store 已经正式加入 Apache
孵化器独立孵化 [1] [2] [3]。新项目的名字是
Apache 
Paimon(incubating),仍致力于打造一个支持高速数据摄入、流式数据订阅和高效实时分析的新一代流式湖仓平台。此外,新项目将支持更加丰富的生态,并建立一个充满活力和中立的开源社区。


在这里我们要感谢大家对 Flink Table Store 项目的大力支持和投入,并热烈欢迎大家加入新项目的开发和社区活动。Apache Flink
将继续作为 Paimon 支持的主力计算引擎之一,我们也相信 Flink 和 Paimon 社区将继续保持密切合作。


Best Regards,

Yu (on behalf of the Apache Flink PMC and Apache Paimon PPMC)


致礼,

李钰(谨代表 Apache Flink PMC 和 Apache Paimon PPMC)


[1] https://paimon.apache.org/

[2] https://github.com/apache/incubator-paimon

[3] https://cwiki.apache.org/confluence/display/INCUBATOR/PaimonProposal


Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Yu Li
Congrats and thanks all for the efforts!

Best Regards,
Yu

On Tue, 19 Mar 2024 at 11:51, gongzhongqiang  wrote:
>
> Congrats! Thanks to everyone involved!
>
> Best,
> Zhongqiang Gong
>
> Lincoln Lee  于2024年3月18日周一 16:27写道:
>>
>> The Apache Flink community is very happy to announce the release of Apache
>> Flink 1.19.0, which is the fisrt release for the Apache Flink 1.19 series.
>>
>> Apache Flink® is an open-source stream processing framework for
>> distributed, high-performing, always-available, and accurate data streaming
>> applications.
>>
>> The release is available for download at:
>> https://flink.apache.org/downloads.html
>>
>> Please check out the release blog post for an overview of the improvements
>> for this bugfix release:
>> https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19/
>>
>> The full release notes are available in Jira:
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353282
>>
>> We would like to thank all contributors of the Apache Flink community who
>> made this release possible!
>>
>>
>> Best,
>> Yun, Jing, Martijn and Lincoln


Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-03-28 Thread Yu Li
CC the Flink user and dev mailing list.

Paimon originated within the Flink community, initially known as Flink
Table Store, and all our incubating mentors are members of the Flink
Project Management Committee. I am confident that the bonds of
enduring friendship and close collaboration will continue to unite the
two communities.

And congratulations all!

Best Regards,
Yu

On Wed, 27 Mar 2024 at 20:35, Guojun Li  wrote:
>
> Congratulations!
>
> Best,
> Guojun
>
> On Wed, Mar 27, 2024 at 5:24 PM wulin  wrote:
>
> > Congratulations~
> >
> > > 2024年3月27日 15:54,王刚  写道:
> > >
> > > Congratulations~
> > >
> > >> 2024年3月26日 10:25,Jingsong Li  写道:
> > >>
> > >> Hi Paimon community,
> > >>
> > >> I’m glad to announce that the ASF board has approved a resolution to
> > >> graduate Paimon into a full Top Level Project. Thanks to everyone for
> > >> your help to get to this point.
> > >>
> > >> I just created an issue to track the things we need to modify [2],
> > >> please comment on it if you feel that something is missing. You can
> > >> refer to apache documentation [1] too.
> > >>
> > >> And, we already completed the GitHub repo migration [3], please update
> > >> your local git repo to track the new repo [4].
> > >>
> > >> You can run the following command to complete the remote repo tracking
> > >> migration.
> > >>
> > >> git remote set-url origin https://github.com/apache/paimon.git
> > >>
> > >> If you have a different name, please change the 'origin' to your remote
> > name.
> > >>
> > >> Please join me in celebrating!
> > >>
> > >> [1]
> > https://incubator.apache.org/guides/transferring.html#life_after_graduation
> > >> [2] https://github.com/apache/paimon/issues/3091
> > >> [3] https://issues.apache.org/jira/browse/INFRA-25630
> > >> [4] https://github.com/apache/paimon
> > >>
> > >> Best,
> > >> Jingsong Lee
> >
> >