Re: Flink upgrade 1.14.6 to 1.15.4 -> StackOverflowError

2023-09-25 Thread Alexey Trenikhun
: Thursday, September 14, 2023 6:42:29 PM To: Flink User Mail List Subject: Re: Flink upgrade 1.14.6 to 1.15.4 -> StackOverflowError Hi, Alexey Have you tried other versions like 1.15.3 or 1.16.1? Best, Ron Alexey Trenikhun mailto:yen...@msn.com>> 于2023年9月15日周五 04:24写道: Hello, After upgrad

Re: Flink 1.15 KubernetesHaServicesFactory

2023-09-18 Thread Alexey Trenikhun
Thank you for information Alexey From: Chen Zhanghao Sent: Friday, September 15, 2023 9:16:00 PM To: Alexey Trenikhun ; Flink User Mail List Subject: Re: Flink 1.15 KubernetesHaServicesFactory Hi Alexey, This is expected as Flink 1.15 introduced a new multiple

Flink 1.15 KubernetesHaServicesFactory

2023-09-15 Thread Alexey Trenikhun
Hello, After upgrading Flink to 1.15.4 from 1.14.6, I've noticed that there are no "{clusterId}-{componentName}-leader" config maps anymore, but instead there are "{clusterId}-cluster-config-map" and "{clusterId}--cluster-config-map". Is it expected ? Thanks, Alexey

Flink upgrade 1.14.6 to 1.15.4 -> StackOverflowError

2023-09-14 Thread Alexey Trenikhun
Hello, After upgrading Flink 1.14.6 to 1.15.4 (Kubernetes, Application mode) job started to failed due to StackOverflowError. The application uses okhttp3 4.9.2, but it is shaded as well as okio. Any ideas what is causing the problem? "java.lang.StackOverflowError: null   at java.base/java.s

Flink at aks 1.25.x - OOMKilled

2023-06-11 Thread Alexey Trenikhun
Hello, We are using Flink 1.13.6 with Azure Kubernetes Service (AKS). Job worked fine for many months, but right after recent upgrade of AKS to 1.25.5, taskmanager started to get OOM killed every day. I suspected that this is maybe because AKS 1.25.x uses cgroups v2, which is not supported by JD

KafkaSource watermarkLag metrics per topic per partition

2022-09-08 Thread Alexey Trenikhun
Hello, Is there way to configure Flink to expose watermarLag metric per topic per partition? I think it could be useful to detect data skew between partitions Thanks, Alexey

Re: Slow Tests in Flink 1.15

2022-09-07 Thread Alexey Trenikhun
ecord deserializer it is hit but very slow, roughly 3 records per 5 minute (the topic was pre-populated) No table/sql API, only stream API From: Chesnay Schepler Sent: Wednesday, September 7, 2022 5:20 AM To: Alexey Trenikhun ; David Jost ; Matthias Pohl Cc:

Re: Slow Tests in Flink 1.15

2022-09-07 Thread Alexey Trenikhun
We are also observing extreme slow down (5+ minutes vs 15 seconds) in 1 of 2 integration tests . Both tests use Kafka. The slow test uses org.apache.flink.runtime.minicluster.TestingMiniCluster, this test tests complete job, which consumes and produces Kafka messages. Not affected test extends

Flink upgrade path

2022-09-06 Thread Alexey Trenikhun
Hello, Can I upgade from 1.13.6 directly to 1.15.2 skipping 1.14.x ? Thanks, Alexey

SourceFunction

2022-06-08 Thread Alexey Trenikhun
Hello, Is there plan to deprecate SourceFunction in favor of Source API? We have custom SourceFunction based source, do we need to plan to rewrite it using new Source API ? Thanks, Alexey

Source without persistent state

2022-05-11 Thread Alexey Trenikhun
Hello, I'm working on custom Source, something like heartbeat generator using new Source API, HeartSource is constructed with list of Kafka topics, SplitEnumerator for each topic queries number of partitions, and either creates a split per topic-partition or single split for all topic-partition

Re: Integration Test for Kafka Streaming job

2022-04-21 Thread Alexey Trenikhun
Thank you for information ! From: Farouk Sent: Thursday, April 21, 2022 1:14:00 AM To: Aeden Jameson Cc: Alexey Trenikhun ; Flink User Mail List Subject: Re: Integration Test for Kafka Streaming job Hi I would recommend to use kafka-junit5 from salesforce

Integration Test for Kafka Streaming job

2022-04-20 Thread Alexey Trenikhun
Hello, We have Flink job that read data from multiple Kafka topics, transforms data and write in output Kafka topics. We want write integration test for it. I've looked at KafkaTableITCase, we can do similar setup of Kafka topics, prepopulate data but since in our case it is endless stream, we n

New JM pod tries to connect to failed JM pod

2022-04-18 Thread Alexey Trenikhun
Hello, We are running Flink 1.13.6 in Kubernetes with k8s HA, the setup includes 1 JM and TM. Recently In jobmanager log I started to see: 2022-04-19T00:11:33.102Z Association with remote system [akka.tcp://flink@10.204.0.126:6123] has failed, address is now gated for [50] ms. Reason: [Associa

Re: Broadcast state corrupted ?

2022-04-13 Thread Alexey Trenikhun
I think I’ve found the cause : dataInputView.read(data) could read partial data, it returns number of bytes stored in buffer. If I use dataInputView.readFully(data) instead, the problem disappears. Thanks, Alexey From: Alexey Trenikhun Sent: Wednesday, April 13

Re: Broadcast state corrupted ?

2022-04-13 Thread Alexey Trenikhun
t;,"thread_name":"jobmanager-future-thread-1","level":"INFO","level_value":2} * I tried cancel with savepoint and cancel from UI. It seems doesn't depend on shutdown, log below is for savepoint without shutdown. And I can't re

Re: Broadcast state corrupted ?

2022-04-13 Thread Alexey Trenikhun
Any suggestions how to troubleshoot the issue? I still can reproduce the problem in environment A Thanks, Alexey From: Alexey Trenikhun Sent: Tuesday, April 12, 2022 7:10:17 AM To: Chesnay Schepler ; Flink User Mail List Subject: Re: Broadcast state corrupted

Re: Broadcast state corrupted ?

2022-04-12 Thread Alexey Trenikhun
I’ve tried to restore job in environment A (where we observe problem) from savepoint taken in environment B - restored fine. So looks something in environment A corrupts savepoint. From: Alexey Trenikhun Sent: Monday, April 11, 2022 7:10:51 AM To: Chesnay

Re: Broadcast state corrupted ?

2022-04-11 Thread Alexey Trenikhun
: Monday, April 11, 2022 2:28:48 AM To: Alexey Trenikhun ; Flink User Mail List Subject: Re: Broadcast state corrupted ? Am I understanding things correctly in that the same savepoint cannot be restored from in 1 environment, while it works fine in 3 others? If so, are they all relying on the same

Broadcast state corrupted ?

2022-04-10 Thread Alexey Trenikhun
Hello, We have KeyedBroadcastProcessFunction with broadcast state MapStateDescriptor, where PbCfgTenantDictionary is Protobuf type, for which we custom TypeInformation/TypeSerializer. In one of environment, we can't restore job from savepoint because seems state data is corrupted. I've added to

Re: Error during shutdown of StandaloneApplicationClusterEntryPoint via JVM shutdown hook

2022-04-08 Thread Alexey Trenikhun
Hi Roman, Currently rest.async.store-duration is not set. Are you suggesting to try to decrease value from default or vice-versa? Thanks, Alexey From: Roman Khachatryan Sent: Friday, April 8, 2022 5:32:45 AM To: Alexey Trenikhun Cc: Flink User Mail List

Error during shutdown of StandaloneApplicationClusterEntryPoint via JVM shutdown hook

2022-04-06 Thread Alexey Trenikhun
Hello, We are using Flink 1.13.6, Application Mode, k8s HA. To upgrade job, we use POST, url=http://gsp-jm:8081/jobs//savepoints, then we wait for up to 5 minutes for completion, periodically pulling status (GET, url=http://gsp-jm:8081/jobs/0

Re: MapState.entries()

2022-03-08 Thread Alexey Trenikhun
Thank you ! From: Schwalbe Matthias Sent: Monday, March 7, 2022 11:36:22 PM To: Alexey Trenikhun ; Flink User Mail List Subject: RE: MapState.entries() Hi Alexey, To my best knowledge it’s lazy with RocksDBStateBackend, using the Java iterator you could

MapState.entries()

2022-03-07 Thread Alexey Trenikhun
Hello, We are using RocksDBStateBackend, is MapState.entries() call in this case "lazy" - deserializes single entry while next(), or MapState.entries() returns collection, which is fully loaded into memory? Thanks, Alexey

Re: TM OOMKilled

2022-02-15 Thread Alexey Trenikhun
onday, February 14, 2022 10:06 PM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: TM OOMKilled Hi Alexey, You may want to double check if `state.backend.rocksdb.memory.managed` is configured to `true`. (This should be `true` by default.) Another question that may or may not be relat

Re: TM OOMKilled

2022-02-14 Thread Alexey Trenikhun
Sent: Monday, February 14, 2022 6:42:05 PM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: TM OOMKilled Hi! Heap memory usage depends heavily on your job and your state backend. Which state backend are you using and if possible could you share your user code or explain what operations

TM OOMKilled

2022-02-14 Thread Alexey Trenikhun
Hello, We run Flink 1.13.5 job in app mode in Kubernetes, 1 JM and 1 TM, we also have Kubernetes cron job which takes savepoint every 2 hour (14 */2 * * *), once in while (~1 per 2 days) TM is OOMKilled, suspiciously it happens on even hours ~4 minutes after savepoint start (e.g. 12:18, 4:18) bu

Re: FlinkKafkaConsumer and FlinkKafkaProducer and Kafka Cluster Migration

2022-01-17 Thread Alexey Trenikhun
s.apache.org From: Fabian Paul Sent: Friday, January 14, 2022 4:02 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: FlinkKafkaConsumer and FlinkKafkaProducer and Kafka Cluster Migration Hi Alexey, The bootstrap servers are not part of the state so

FlinkKafkaConsumer and FlinkKafkaProducer and Kafka Cluster Migration

2022-01-13 Thread Alexey Trenikhun
Hello, Currently we are using FlinkKafkaConsumer and FlinkKafkaProducer and planning to migrate to different Kafka cluster. Are boostrap servers, username and passwords part of FlinkKafkaConsumer and FlinkKafkaProducer ? So if we take savepoint change boostrap server and credentials and start

Flink 1.13.3, k8s HA - ResourceManager was revoked leadership

2021-12-10 Thread Alexey Trenikhun
Hello, I'm running Flink 1.13.3 with Kubernetes HA. JM periodically restarts after some time, in log below job runs ~8 minutes, then suddenly leadership was revoked, job reaches terminal state and K8s restarts failed JM: {"timestamp":"2021-12-11T04:51:53.697Z","message":"Agent Info (1/1) (47e67

Re: broadcast() without arguments

2021-12-10 Thread Alexey Trenikhun
Thank you Roman From: Roman Khachatryan Sent: Friday, December 10, 2021 1:48:08 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: broadcast() without arguments Hello, The broadcast() without arguments can be used the same way as a regular data

broadcast() without arguments

2021-12-09 Thread Alexey Trenikhun
Hello, How broadcast()​ method without arguments should be used ? Thanks, Alexey

Re: Order of events in Broadcast State

2021-12-06 Thread Alexey Trenikhun
Thank you David From: David Anderson Sent: Monday, December 6, 2021 1:36:20 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Order of events in Broadcast State Event ordering in Flink is only maintained between pairs of events that take exactly

Re: Order of events in Broadcast State

2021-12-03 Thread Alexey Trenikhun
nality. As our running example, we will use the case where we have a ... nightlies.apache.org From: Alexey Trenikhun Sent: Friday, December 3, 2021 4:33 PM To: Flink User Mail List Subject: Order of events in Broadcast State Hello, Trying to understand what sta

Order of events in Broadcast State

2021-12-03 Thread Alexey Trenikhun
Hello, Trying to understand what statement "Order of events in Broadcast State may differ across tasks" in [1] means. Let's say I have keyed function "A" which broadcasting stream of rules, KeyedBroadcastProcessFunction "B" receives rules and updates broadcast state, like example in [1]. Let's

Re: Could not retrieve submitted JobGraph from state handle

2021-11-18 Thread Alexey Trenikhun
: Alexey Trenikhun ; Flink User Mail List Subject: Re: Could not retrieve submitted JobGraph from state handle Hi Alexey, If you delete the HA data stored in the S3 manually or maybe you configured an automatic clean-up rule, then it could happen that the ConfigMap has the pointers while the

Could not retrieve submitted JobGraph from state handle

2021-11-15 Thread Alexey Trenikhun
Hello, We are using Kubernetes HA and Azure Blob storage and in rare cases I see following error: Could not retrieve submitted JobGraph from state handle under jobGraph-. This indicates that the retrieved state handle is broken. Try cleaning the state handle stor

Re: Question on BoundedOutOfOrderness

2021-11-02 Thread Alexey Trenikhun
Hi Oliver, I believe you also need to do sort, out of order ness watermark strategy only “postpone” watermark for given expected maximum of out of orderness. Check Ververica example - https://github.com/ververica/flink-training-exercises/blob/master/src/main/java/com/ververica/flinktraining/exam

Re: checkpoints/.../shared cleanup

2021-10-09 Thread Alexey Trenikhun
From: Roman Khachatryan Sent: Friday, October 1, 2021 8:51:31 AM To: Alexey Trenikhun Cc: Matthias Pohl ; Flink User Mail List ; sjwies...@gmail.com Subject: Re: checkpoints/.../shared cleanup Hi Alexey, Thanks for sharing this information. I also don't see anything suspicious i

Re: checkpoints/.../shared cleanup

2021-08-31 Thread Alexey Trenikhun
I'm running Flink in Application Mode and set jobId explicitly From: Khachatryan Roman Sent: Monday, August 30, 2021 7:16 AM To: Alexey Trenikhun Cc: Matthias Pohl ; Flink User Mail List ; sjwies...@gmail.com Subject: Re: checkpoints/.../shared cleanup H

Re: checkpoints/.../shared cleanup

2021-08-26 Thread Alexey Trenikhun
lder I see older files from previous version of job. This upgrade process repeated again, as result the shared subfolder grows and grows Thanks, Alexey ____ From: Alexey Trenikhun Sent: Thursday, August 26, 2021 6:37:27 PM To: Matthias Pohl Cc: Flink User Mail Lis

Re: checkpoints/.../shared cleanup

2021-08-26 Thread Alexey Trenikhun
pointing is enabled, managed state is persisted to ensure ... ci.apache.org Thanks, Alexey From: Matthias Pohl Sent: Thursday, August 26, 2021 5:42 AM To: Alexey Trenikhun Cc: Flink User Mail List ; sjwies...@gmail.com Subject: Re: checkpoints/.../shared cleanup

checkpoints/.../shared cleanup

2021-08-24 Thread Alexey Trenikhun
Hello, I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job (or cancel with savepoint). Looks like in our case shared continutes to grow... Thanks, Alexey

Re: High DirectByteBuffer Usage

2021-07-15 Thread Alexey Trenikhun
Just in case, make sure that you are not using Kafka SSL port without setting security protocol, see [1] [1] https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-4090 From: bat man Sent: Wednesday, July 14, 2021 10:55:54 AM To: Timo Walther Cc: user

Re: How to check is Kafka partition "idle" in emitRecordsWithTimestamps

2021-06-01 Thread Alexey Trenikhun
eldName Comment rendererType atlassian-wiki-renderer issueKey FLINK-18450 Preview comment issues.apache.org Thanks, Alexey From: Till Rohrmann Sent: Tuesday, June 1, 2021 6:24 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: How to check is Kafka part

How to check is Kafka partition "idle" in emitRecordsWithTimestamps

2021-05-28 Thread Alexey Trenikhun
Hello, I'm thinking about implementing custom Kafka connector which provides event alignment (similar to FLINK-10921, which seems abandoned). What is the way to determine is partition is idle from override of AbstractFetcher.emitRecordsWithTimestamps()? Does KafkaTopicPartitionState has this in

Re: KafkaSource metrics

2021-05-26 Thread Alexey Trenikhun
Found https://issues.apache.org/jira/browse/FLINK-22766 From: Alexey Trenikhun Sent: Tuesday, May 25, 2021 3:25 PM To: Ardhani Narasimha ; 陳樺威 ; Flink User Mail List Subject: Re: KafkaSource metrics Looks like when KafkaSource is used instead of

Re: KafkaSource metrics

2021-05-25 Thread Alexey Trenikhun
Looks like when KafkaSource is used instead of FlinkKafkaConsumer, metrics listed below are not available. Bug? Work in progress? Thanks, Alexey From: Ardhani Narasimha Sent: Monday, May 24, 2021 9:08 AM To: 陳樺威 Cc: user Subject: Re: KafkaSource metrics Use b

Re: After upgrade from 1.11.2 to 1.13.0 parameter taskmanager.numberOfTaskSlots set to 1.

2021-05-18 Thread Alexey Trenikhun
If flink-conf.yaml is readonly, flink will complain but work fine? From: Chesnay Schepler Sent: Wednesday, May 12, 2021 5:38 AM To: Alex Drobinsky Cc: user@flink.apache.org Subject: Re: After upgrade from 1.11.2 to 1.13.0 parameter taskmanager.numberOfTaskSlots

KafkaSource

2021-05-17 Thread Alexey Trenikhun
Hello, Is new KafkaSource/KafkaSourceBuilder ready to be used ? If so, is KafkaSource state compatible with legacy FlinkKafkaConsumer, for example if I replace FlinkKafkaConsumer by KafkaSource, will offsets continue from what we had in FlinkKafkaConsumer ? Thanks, Alexey

Re: Helm chart for Flink

2021-05-17 Thread Alexey Trenikhun
I think it should be possible to use Helm pre-upgrade hook to take savepoint and stop currently running job and then Helm will upgrade image tags. The problem is that if you hit timeout while taking savepoint, it is not clear how to recover from this situation Alexey ___

Re: reactive mode and back pressure

2021-05-16 Thread Alexey Trenikhun
modes. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/checkpoints/#unaligned-checkpoints On Fri, May 14, 2021 at 11:29 PM Alexey Trenikhun mailto:yen...@msn.com>> wrote: Hello, Is new reactive mode can operate under back pr

The heartbeat of JobManager timed out

2021-05-15 Thread Alexey Trenikhun
Hello, I periodically see in JM log (Flink 12.2): {"ts":"2021-05-15T21:10:36.325Z","message":"The heartbeat of JobManager with id be8225ebae1d6422b7f268c801044b05 timed out.","logger_name":"org.apache.flink.runtime.resourcemanager.StandaloneResourceManager","thread_name":"flink-akka.actor.defau

reactive mode and back pressure

2021-05-14 Thread Alexey Trenikhun
Hello, Is new reactive mode can operate under back pressure? Old manual rescaling via taking savepoint didn't work with system under back pressure, since it was practically impossible to take savepoint, so wondering is reactive mode expected to be better in this regards ? Thanks, Alexey

Unexpected key-group in restore

2021-05-11 Thread Alexey Trenikhun
Hello, Suddenly our job (Flink 1.12.2) started to failing to restore from savepoint due to unexpected key group, it looks like stack is for memory state backend, while in fact actually in flink-conf.yaml state.backend: rocksdb {"ts":"2021-05-11T15:44:09.004Z","message":"Loading configuration prop

Re: idleTimeMsPerSecond exceeds 1000

2021-04-21 Thread Alexey Trenikhun
Great. Thank ypu From: Chesnay Schepler Sent: Wednesday, April 21, 2021 1:02:59 AM To: Alexey Trenikhun ; Flink User Mail List Subject: Re: idleTimeMsPerSecond exceeds 1000 This ticket seems related; the issue was fixed in 1.13: https://issues.apache.org/jira

idleTimeMsPerSecond exceeds 1000

2021-04-20 Thread Alexey Trenikhun
Hello, When Flink job mostly idle, idleTimeMsPerSecond for given task_name and subtask_index sometimes exceeds 1000, I saw values up to 1350, but usually not higher than 1020. Is it due to accuracy of nanoTime/currentTimeMillis or there is bug in calculations ? Thanks, Alexey

Re: Checkpoint fail due to timeout

2021-03-30 Thread Alexey Trenikhun
wojski Sent: Tuesday, March 23, 2021 5:31 AM To: Alexey Trenikhun Cc: Arvid Heise ; ChangZhuo Chen (陳昌倬) ; ro...@apache.org ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hi Alexey, You should definitely investigate why the job is stuck. 1. First of all, is it completely s

Re: Checkpoint fail due to timeout

2021-03-30 Thread Alexey Trenikhun
uring next performance run. Thanks, Alexey From: Roman Khachatryan Sent: Tuesday, March 23, 2021 12:17 AM To: Alexey Trenikhun Cc: ChangZhuo Chen (陳昌倬) ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Unfortunately, the lock can't be chang

Re: EOFException on attempt to scale up job with RocksDB state backend

2021-03-24 Thread Alexey Trenikhun
Hi Yun, Finally I was able to try to rescale with block blobs configured - rescaled from 6 to 8 w/o problem. So loos like indeed there is problem with page blob. Thank you for help, Alexey From: Alexey Trenikhun Sent: Thursday, March 18, 2021 11:31 PM To: Yun

Re: Kubernetes HA - attempting to restore from wrong (non-existing) savepoint

2021-03-24 Thread Alexey Trenikhun
From: Yang Wang Sent: Tuesday, March 23, 2021 11:17:18 PM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Kubernetes HA - attempting to restore from wrong (non-existing) savepoint Hi Alexey, >From your attached logs, I do not think the new start JobManager will recover &g

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
hread.run(SourceStreamTask.java:263) Thanks, Alexey From: Roman Khachatryan Sent: Monday, March 22, 2021 1:36 AM To: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Thanks for sharing the thread dump. It

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Thanks for sharing the thread dump. It shows that the source thread is indeed back-pressured (checkpoint lock is held by a thread which is trying to emit but unable to acquire any free buffers

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
checkpoint still times out after 3hr. From: Arvid Heise Sent: Monday, March 22, 2021 6:58:20 AM To: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; ro...@apache.org ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hi Alexey, rescaling from

Histogram

2021-03-19 Thread Alexey Trenikhun
Hello, Is any way to expose from Flink Histogram metric, not Summary ? Thanks, Alexey

Re: inputFloatingBuffersUsage=0?

2021-03-19 Thread Alexey Trenikhun
, 2021 5:01 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: inputFloatingBuffersUsage=0? Hi, clarification about the 2nd part. Required memory is one single exclusive buffer per channel, so if you are running low on memory, floating buffers are one of the first to go, hence you could

Re: EOFException on attempt to scale up job with RocksDB state backend

2021-03-18 Thread Alexey Trenikhun
March 18, 2021 5:08 AM To: Alexey Trenikhun ; Tzu-Li (Gordon) Tai ; user@flink.apache.org Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Alexey, Flink would only write once for checkpointed files. Could you try to write checkpointed files as block blob format

inputFloatingBuffersUsage=0?

2021-03-18 Thread Alexey Trenikhun
Hello, While trying to investigate back pressure using hints from [1], I've noticed that flink_taskmanager_job_task_buffers_inPoolUsage and flink_taskmanager_job_task_buffers_inputFloatingBuffersUsage are always 0, which looks suspicious, are these metrics still populated ? Thanks, Alexey [1]

Re: EOFException on attempt to scale up job with RocksDB state backend

2021-03-17 Thread Alexey Trenikhun
ed in [1] or without compaction? Thanks, Alexey From: Yun Tang Sent: Wednesday, March 17, 2021 9:31 PM To: Alexey Trenikhun ; Tzu-Li (Gordon) Tai ; user@flink.apache.org Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi Alexey,

Re: EOFException on attempt to scale up job with RocksDB state backend

2021-03-17 Thread Alexey Trenikhun
, not sure does it have effect on savepoint or not) Thanks, Alexey From: Yun Tang Sent: Wednesday, March 17, 2021 12:33 AM To: Alexey Trenikhun ; Tzu-Li (Gordon) Tai ; user@flink.apache.org Subject: Re: EOFException on attempt to scale up job with RocksDB state b

Re: Checkpoint fail due to timeout

2021-03-17 Thread Alexey Trenikhun
("hdfs:///checkpoints-data/")); Difference to Savepoints ci.apache.org From: ChangZhuo Chen (陳昌倬) Sent: Wednesday, March 17, 2021 12:29 AM To: Alexey Trenikhun Cc: ro...@apache.org; Flink User Mail List Subject: Re: Checkpoint fail due to timeout On Wed, Ma

Re: Checkpoint fail due to timeout

2021-03-16 Thread Alexey Trenikhun
From: ChangZhuo Chen (陳昌倬) Sent: Tuesday, March 16, 2021 6:59 AM To: Alexey Trenikhun Cc: ro...@apache.org; Flink User Mail List Subject: Re: Checkpoint fail due to timeout On Tue, Mar 16, 2021 at 02:32:54AM +0000, Alexey Trenikhun wrote: > Hi Roman, > I took thread dump: > "Source:

Re: EOFException on attempt to scale up job with RocksDB state backend

2021-03-16 Thread Alexey Trenikhun
Also restore from same savepoint without change in parallelism works fine. From: Alexey Trenikhun Sent: Monday, March 15, 2021 9:51 PM To: Yun Tang ; Tzu-Li (Gordon) Tai ; user@flink.apache.org Subject: Re: EOFException on attempt to scale up job with RocksDB

Re: EOFException on attempt to scale up job with RocksDB state backend

2021-03-15 Thread Alexey Trenikhun
No, I believe original exception was from 1.12.1 to 1.12.1 Thanks, Alexey From: Yun Tang Sent: Monday, March 15, 2021 8:07:07 PM To: Alexey Trenikhun ; Tzu-Li (Gordon) Tai ; user@flink.apache.org Subject: Re: EOFException on attempt to scale up job with

Re: Checkpoint fail due to timeout

2021-03-15 Thread Alexey Trenikhun
k or per TM? I see multiple threads in SynchronizedStreamTaskActionExecutor.runThrowing blocked on different Objects. Thanks, Alexey From: Roman Khachatryan Sent: Monday, March 15, 2021 2:16 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hello Alexey,

Re: EOFException on attempt to scale up job with RocksDB state backend

2021-03-15 Thread Alexey Trenikhun
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2 From: Tzu-Li (Gordon) Tai Sent: Monday, March 15, 2021 12:06 AM To: user@flink.apache.org Subject: Re: EOFException on attempt to scale up job with RocksDB state backend Hi,

Re: Kubernetes HA - attempting to restore from wrong (non-existing) savepoint

2021-03-14 Thread Alexey Trenikhun
From: Yang Wang Sent: Sunday, March 14, 2021 7:50:21 PM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Kubernetes HA - attempting to restore from wrong (non-existing) savepoint If the HA related ConfigMaps still exists, then I am afraid the data located on the distributed storage

Re: Kubernetes HA - attempting to restore from wrong (non-existing) savepoint

2021-03-11 Thread Alexey Trenikhun
deleting Kubernetes Job and HA configmaps enough, or something in persisted storage should be deleted as well? Thanks, Alexey From: Yang Wang Sent: Thursday, March 11, 2021 2:59 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Kubernetes HA

Re: User metrics outside tasks

2021-03-11 Thread Alexey Trenikhun
-OOM counter From: Arvid Heise Sent: Thursday, March 11, 2021 5:22:30 AM To: Alexey Trenikhun Cc: Flink User Mail List ; Chesnay Schepler Subject: Re: User metrics outside tasks Hi Alexey, could you describe what you want to achieve? Most metrics are bound to a

EOFException on attempt to scale up job with RocksDB state backend

2021-03-10 Thread Alexey Trenikhun
Hello, I was trying to scale job up, took save point, changed parallelism setting from 6 to 8 and started job from savepoint: switched from RUNNING to FAILED on 10.204.2.98:6122-2946e1 @ gsp-tm-0.gsp-headless.gsp.svc.cluster.local (dataPort=41409). java.lang.Exception: Exception while creating S

User metrics outside tasks

2021-03-10 Thread Alexey Trenikhun
Hello, Is it possible to register user metric outside task/operator (not in RichMapFunction#open) Thanks, Alexey

Re: Kubernetes HA - attempting to restore from wrong (non-existing) savepoint

2021-03-09 Thread Alexey Trenikhun
Hi Yang, The problem is re-occurred, full JM log is attached Thanks, Alexey From: Yang Wang Sent: Sunday, February 28, 2021 10:04 PM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Kubernetes HA - attempting to restore from wrong (non-existing

Re: Re: How to check checkpointing mode

2021-03-09 Thread Alexey Trenikhun
u for your help Alexey From: Yun Gao Sent: Monday, March 8, 2021 7:14 PM To: Alexey Trenikhun ; Flink User Mail List Subject: Re: Re: How to check checkpointing mode Hi Alexey, Sorry I also do not see problems in the attached code. Could you add a breakpoint

Re: Trigger and completed Checkpointing do not appeared

2021-03-08 Thread Alexey Trenikhun
The picture in first e-mail shows that job was completed in 93ms From: Abdullah bin Omar Sent: Monday, March 8, 2021 3:53 PM To: user@flink.apache.org Subject: Re: Trigger and completed Checkpointing do not appeared Hi, Please read the previous email (and also

Re: Job downgrade

2021-03-04 Thread Alexey Trenikhun
Hi Gordon, I was using RocksDB backend Alexey From: Tzu-Li (Gordon) Tai Sent: Thursday, March 4, 2021 12:58:01 AM To: Alexey Trenikhun Cc: Piotr Nowojski ; Flink User Mail List Subject: Re: Job downgrade Hi Alexey, Are you using the heap backend? If that&#

Re: Job downgrade

2021-03-03 Thread Alexey Trenikhun
expected “rollback” to version 1+. From: Piotr Nowojski Sent: Wednesday, March 3, 2021 11:47:45 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Job downgrade Hi, I'm not sure what's the reason behind this. Probably classes are somehow a

Re: Kubernetes HA - attempting to restore from wrong (non-existing) savepoint

2021-03-01 Thread Alexey Trenikhun
From: Yang Wang Sent: Sunday, February 28, 2021 10:04 PM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Kubernetes HA - attempting to restore from wrong (non-existing) savepoint Hi Alexey, It seems that the KubernetesHAService works well since all the checkpoints have been cleaned u

Re: Producer Configuration

2021-02-27 Thread Alexey Trenikhun
, 2021 4:00:23 PM To: Alexey Trenikhun Cc: user Subject: Re: Producer Configuration Yes, the flink job also works in producing messages. It's just that after a short period of time, it fails w/ a timeout. That is why I'm trying to set a longer timeout period but it doesn't seem like

Re: Producer Configuration

2021-02-27 Thread Alexey Trenikhun
Can you produce messages using Kafka console producer connect using same properties ? From: Claude M Sent: Saturday, February 27, 2021 8:05 AM To: Alexey Trenikhun Cc: user Subject: Re: Producer Configuration Thanks for your reply, yes it was specified

Job downgrade

2021-02-27 Thread Alexey Trenikhun
Hello, Let's have version 1 of my job uses keyed state with name "a" and type A, which some Avro generated class. Then I upgrade to version 2, which in addition uses keyed state "b" and type B (another concrete Avro generated class), I take savepoint with version 2 and decided to downgrade to ve

Re: Producer Configuration

2021-02-26 Thread Alexey Trenikhun
I believe bootstrap.servers is mandatory Kafka property, but it looks like you didn’t set it From: Claude M Sent: Friday, February 26, 2021 12:02:10 PM To: user Subject: Producer Configuration Hello, I created a simple Producer and when the job ran, it was get

Kubernetes HA - attempting to restore from wrong (non-existing) savepoint

2021-02-26 Thread Alexey Trenikhun
Hello, We have Flink job running in Kubernetes with Kuberenetes HA enabled (JM is deployed as Job, single TM as StatefulSet). We taken savepoint with cancel=true. Now when we are trying to start job using --fromSavepoint A, where is A path we got from taking savepoint (ClusterEntrypoint reports

Re: stop job with Savepoint

2021-02-20 Thread Alexey Trenikhun
Adding "list" to verbs helps, do I need to add anything else ? ____ From: Alexey Trenikhun Sent: Saturday, February 20, 2021 2:10 PM To: Flink User Mail List Subject: stop job with Savepoint Hello, I'm running per job Flink cluster, JM is deployed as

stop job with Savepoint

2021-02-20 Thread Alexey Trenikhun
Hello, I'm running per job Flink cluster, JM is deployed as Kubernetes Job with restartPolicy: Never, highavailability is KubernetesHaServicesFactory. Job runs fine for some time, configmaps are created etc. Now in order to upgrade Flink job, I'm trying to stop job with savepoint (flink stop $J

Publish heartbeat messages in all Kafka partitions

2021-01-28 Thread Alexey Trenikhun
Hello, We need to publish heartbeat messages in all topic partitions. Is possible to produce single message and then somehow broadcast it to all partitions from FlinkKafkaProducer? Or only way that message source knows number of existing partitions and sends 1 message to 1 partition? Thanks, Al

Re: flink-python_2.12-1.12.0.jar

2021-01-18 Thread Alexey Trenikhun
Thank you! From: Chesnay Schepler Sent: Monday, January 18, 2021 3:14 PM To: Alexey Trenikhun ; Flink User Mail List Subject: Re: flink-python_2.12-1.12.0.jar The flink-python jar is only required for running python jobs. If you don't use such jobs yo

flink-python_2.12-1.12.0.jar

2021-01-18 Thread Alexey Trenikhun
Hello, Is flink-python_2.12-1.12.0.jar in docker image needed to run Java based stream processor? Will Flink work if we will remove this jar from docker image and run Java job? Thanks, Alexey

Re: Flink Application cluster/standalone job: some JVM Options added to Program Arguments

2021-01-18 Thread Alexey Trenikhun
JOB args? Thanks, Alexey From: Matthias Pohl Sent: Sunday, January 17, 2021 11:53:29 PM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Flink Application cluster/standalone job: some JVM Options added to Program Arguments Hi Alexey, thanks for

Flink Application cluster/standalone job: some JVM Options added to Program Arguments

2021-01-15 Thread Alexey Trenikhun
Hello, I was trying to deploy Flink 1.12.0 Application cluster on k8s, I have following job manager arguments: standalone-job --job-classname com.x.App --job-id @/opt/flink/conf/fsp.conf However, when I print args from App.main(): [@/opt/flink/conf/ssp.conf, -D,

Re: state reset(lost) on TM recovery

2021-01-13 Thread Alexey Trenikhun
Ok, thanks. From: Chesnay Schepler Sent: Wednesday, January 13, 2021 11:46:15 AM To: Alexey Trenikhun ; Flink User Mail List Subject: Re: state reset(lost) on TM recovery The FsStateBackend makes heavy use of hashcodes, so it must be stable. On 1/13/2021 7:13

  1   2   >