date:20210701

Re: Memory usage UI

2021-07-01 Thread Xintong Song

There are two things that you may want to try: - Configure 'state.backend.rocksdb.memory.managed' to true. That tells RocksDBStateBackend to try to limit its memory consumption within the managed memory size. - Increase JVM Overhead Memory Size (via 'taskmanager.memory.jvm-overhead.[min|max|fractio

Re: Memory usage UI

2021-07-01 Thread Sudharsan R

Hi Xintong, Thanks very much for the response. Let me check out the new UI on flink 1.12. The reason I asked this question is because our flink cluster on k8s shows a container_working_set_bytes(used by OOMkiller) to be > 3Gb. I assume that the used(heap, non-heap) values on the UI are correct. If

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

2021-07-01 Thread 刘建刚

Thanks for the discussion, JING ZHANG. I like the first proposal since it is simple and consistent with dataStream API. It is helpful to add more docs about the special late case in WindowAggregate. Also, I expect the more flexible emit strategies later. Jark Wu 于2021年7月2日周五上午10:33写道： > Sorry,

Re: How to calculate how long an event stays in flink？

2021-07-01 Thread JING ZHANG

Hi Xiuming, +1 on your idea. BTW, Flink also provides a debug tool to track the latency of records travelling through the system[1]. But you should note the following issue if enable the latency tracking. (1) It's a tool for debugging purposes because enabling latency metrics can significantly impa

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

2021-07-01 Thread Jark Wu

Sorry, I made a typo above. I mean I prefer proposal (1) that only needs to set `table.exec.emit.allow-lateness` to handle late events. `table.exec.emit.late-fire.delay` can be optional which is 0s by default. `table.exec.state.ttl` will not affect window state anymore, so window state is still cle

Re: Memory usage UI

2021-07-01 Thread Xintong Song

Hi Sudharsan, The non-heap max is decided by JVM automatically and is not controlled by Flink. Moreover, it doesn't mean Flink will use up to that size of non-heap memory. These metrics are fetched directly from JVM and do not correspond well with Flink's memory configurations, which very often l

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu

Another side question, Shall we add metric to cover the complete restarting time (phase 1 + phase 2)? Current metric jm.restartingTime only covers phase 1. Thanks! Best Lu On Thu, Jul 1, 2021 at 12:09 PM Lu Niu wrote: > Thanks TIll and Yang for help! Also Thanks Till for a quick fix! > > I did

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu

Thanks TIll and Yang for help! Also Thanks Till for a quick fix! I did another test yesterday. In this test, I intentionally throw exception from the source operator: ``` if (runtimeContext.getIndexOfThisSubtask() == 1 && errorFrenquecyInMin > 0 && System.currentTimeMillis() - last

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu

Thanks TIll and Yang for help! Also Thanks Till for a quick fix! I did another test yesterday. In this test, I intentionally throw exception from the source operator: ``` if (runtimeContext.getIndexOfThisSubtask() == 1 && errorFrenquecyInMin > 0 && System.currentTimeMillis() - last

Memory usage UI

2021-07-01 Thread Sudharsan R

Hi, On my flink setup, I have taskmanager.memory.process.size set to 2536M. I expect all the memory components shown on the UI to add up to this number. However, I don't see this. I have flink managed memory: 811Mb JVM heap max: 886Mb JVM non-heap max: 744Mb Direct memory: 204Mb This adds

Using Flink's Kubernetes API inside Java

2021-07-01 Thread Alexis Sarda-Espinosa

Hello everyone, I'm testing a custom Kubernetes operator that should fulfill some specific requirements I have for Flink. I know of this WIP project: https://github.com/wangyang0918/flink-native-k8s-operator I can see that it uses some classes that aren't publicly documented, and I believe it

Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-07-01 Thread Austin Cawley-Edwards

Hi Shilpa, I've confirmed that "recovered" jobs are not compatible between minor versions of Flink (e.g., between 1.12 and 1.13). I believe the issue is that the session cluster was upgraded to 1.13 without first stopping the jobs running on it. If this is the case, the workaround is to stop each

Re: Recommended way to submit a SQL job via code without getting tied to a Flink version?

2021-07-01 Thread Sonam Mandal

Hi Stephan, Thanks for the detailed explanation! This really helps understand all this better. Appreciate your help! Regards, Sonam From: Stephan Ewen Sent: Wednesday, June 30, 2021 3:56:22 AM To: Sonam Mandal ; user@flink.apache.org Cc: matth...@ververica.com

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann

A quick addition, I think with FLINK-23202 it should now also be possible to improve the heartbeat mechanism in the general case. We can leverage the unreachability exception thrown if a remote target is no longer reachable to mark an heartbeat target as no longer reachable [1]. This can then be co

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

2021-07-01 Thread Jark Wu

Thanks Jing for bringing up this topic, The emit strategy configs are annotated as Experiential and not public on documentations. However, I see this is a very useful feature which many users are looking for. I have posted these configs for many questions like "how to handle late events in SQL". T

Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-07-01 Thread Shilpa Shankar

Hi Zhu, Does is mean our upgrades are going to fail and the jobs are not backward compatible? I did verify the job itself is built using 1.13.0. Is there a workaround for this? Thanks, Shilpa On Wed, Jun 30, 2021 at 11:14 PM Zhu Zhu wrote: > Hi Shilpa, > > JobType was introduced in 1.13. So

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Yang Wang

Since you are deploying Flink workloads on Yarn, the Flink ResourceManager should get the container completion event after the heartbeat of Yarn NM->Yarn RM->Flink RM, which is 8 seconds by default. And Flink ResourceManager will release the dead TaskManager container once received the completion e

Re: Exception in snapshotState suppresses subsequent checkpoints

2021-07-01 Thread tao xiao

Thanks for the pointer. let me try upgrading the flink On Thu, Jul 1, 2021 at 5:29 PM Yun Tang wrote: > Hi Tao, > > I run your program with Flink-1.12.1 and found the problem you described > really existed. And things would go normal if switching to Flink-1.12.2 > version. > > After dig into the

Re: Exception in snapshotState suppresses subsequent checkpoints

2021-07-01 Thread Yun Tang

Hi Tao, I run your program with Flink-1.12.1 and found the problem you described really existed. And things would go normal if switching to Flink-1.12.2 version. After dig into the root cause, I found this is caused by a fixed bug [1]: If a legacy source task fails outside of the legacy thread,

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann

The analysis of Gen is correct. Flink currently uses its heartbeat as the primary means to detect dead TaskManagers. This means that Flink will take at least `heartbeat.timeout` time before the system recovers. Even if the cancellation happens fast (e.g. by having configured a low akka.ask.timeout)

Re: Exception in snapshotState suppresses subsequent checkpoints

2021-07-01 Thread Matthias Pohl

Hi Tao, it looks like it should work considering that you have a sleep of 1 second before each emission. I'm going to add Roman to this thread. Maybe, he has sees something obvious which I'm missing. Could you run the job with the log level set to debug and provide the logs once more? Additionally,

Re: specify number of TM; how stream app use state of batch app; orc / parquet file format have different impact on tpcds performance benchmark.

2021-07-01 Thread Yangze Guo

> 1. how to specify the number of TaskManager? > In batch mode, I tried to use (Max Parallelism / (cores per tm)), but it > does not work. Number of TaskManager is muchlarger than (Max Parallelism / > cores per tm). It not the cores per tm, but the number of slots per tm. Please refer to ta

Re: Memory usage UI

Re: Memory usage UI

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

Re: How to calculate how long an event stays in flink？

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

Re: Memory usage UI

Re: Job Recovery Time on TM Lost

Re: Job Recovery Time on TM Lost

Re: Job Recovery Time on TM Lost

Memory usage UI

Using Flink's Kubernetes API inside Java

Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

Re: Recommended way to submit a SQL job via code without getting tied to a Flink version?

Re: Job Recovery Time on TM Lost

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

Re: Job Recovery Time on TM Lost

Re: Exception in snapshotState suppresses subsequent checkpoints

Re: Exception in snapshotState suppresses subsequent checkpoints

Re: Job Recovery Time on TM Lost

Re: Exception in snapshotState suppresses subsequent checkpoints

Re: specify number of TM; how stream app use state of batch app; orc / parquet file format have different impact on tpcds performance benchmark.

22 matches

Site Navigation

Mail list logo

Footer information