Re: checkpoint upload thread

2024-08-01 Thread Yanfei Lei
d but with applied probability? > > Thanks. > > > -- 原始邮件 -- > 发件人: "Yanfei Lei" ; > 发送时间: 2024年7月30日(星期二) 下午5:15 > 收件人: "Enric Ott"<243816...@qq.com>; > 抄送: "user"; > 主题: Re: checkpoint upload thread > > Hi Enric

Re: checkpoint upload thread

2024-07-30 Thread Yanfei Lei
Hi Enric, If I understand correctly, one subtask would use one `asyncOperationsThreadPool`[1,2], it is possible to use the same connection for an operator chain. [1] https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask

Re: State leak in tumbling windows

2024-06-06 Thread Yanfei Lei
Hi Adam, Is your job a datastream job or a sql job? After I looked through the window-related code(I'm not particularly familiar with this part of the code), this problem should only exist in datastream. Adam Domanski 于2024年6月3日周一 16:54写道: > > Dear Flink users, > > I spotted the ever growing che

Re: TTL issue with large RocksDB keyed state

2024-06-03 Thread Yanfei Lei
Hi, > 1. After multiple full checkpoints and a NATIVE savepoint the size was > unchanged. I'm wondering if RocksDb compaction is because we never update > key values? The state is nearly fully composed of keys' space. Do keys not > get freed using RocksDb compaction filter for TTL? Regarding

Re: Flink 1.18: Unable to resume from a savepoint with error InvalidPidMappingException

2024-04-23 Thread Yanfei Lei
t; like 7 days, as a 7 days old savepoints is effectively worthless, and > probably adjust "transaction.timeout.ms" to be close to this. > > But can you explain how "transactional.id.expiration.ms" influences the > InvalidPidMappingException, or why having "tran

Re: Flink 1.18: Unable to resume from a savepoint with error InvalidPidMappingException

2024-04-21 Thread Yanfei Lei
Hi JM, Yes, `InvalidPidMappingException` occurs because the transaction is lost in most cases. For short-term, " transaction.timeout.ms" > "transactional.id.expiration.ms" can ignore the `InvalidPidMappingException`[1]. For long-term, FLIP-319[2] provides a solution. [1] https://speakerdeck.com

Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-03-28 Thread Yanfei Lei
Congratulations! Best, Yanfei Zhanghao Chen 于2024年3月28日周四 19:59写道: > > Congratulations! > > Best, > Zhanghao Chen > > From: Yu Li > Sent: Thursday, March 28, 2024 15:55 > To: d...@paimon.apache.org > Cc: dev ; user > Subject: Re: [ANNOUNCE] Apache Paimon is gr

Re: Flink job unable to restore from savepoint

2024-03-27 Thread Yanfei Lei
Hi Prashant, Compared to the job that generated savepoint, are there any changes in the new job? For example, data fields were added or deleted, or the type serializer was changed? More detailed job manager logs may help. prashant parbhane 于2024年3月27日周三 14:20写道: > > Hello, > > We have been facin

Re: There is no savepoint operation with triggerId

2024-03-25 Thread Yanfei Lei
Hi Lars, It looks like the relevant logs when retrieving savepoint. Have you frequently retrieved savepoints through the REST interface? Lars Skjærven 于2024年3月26日周二 07:17写道: > > Hello, > My job manager is constantly complaining with the following error: > > "Exception occurred in REST handler: T

Re: Is there any options to control the file names in file sink

2024-03-20 Thread Yanfei Lei
Hi Lasse, If the datastream job is used, you can try setting `OutputFileConfig` for file sink, something like[1]: ``` OutputFileConfig config = OutputFileConfig .builder() .withPartPrefix("prefix") .withPartSuffix(".ext") .build(); FileSink> sink = FileSink .forRowFormat((new Path(outputPath)

Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Yanfei Lei
Congrats, thanks for the great work! Sergey Nuyanzin 于2024年3月18日周一 19:30写道: > > Congratulations, thanks release managers and everyone involved for the great > work! > > On Mon, Mar 18, 2024 at 12:15 PM Benchao Li wrote: >> >> Congratulations! And thanks to all release managers and everyone >> i

Re: Flink Checkpoint & Offset Commit

2024-03-07 Thread Yanfei Lei
Hi Jacob, > I have multiple upstream sources to connect to depending on the business > model which are not Kafka. Based on criticality of the system and publisher > dependencies, we cannot switch to Kafka for these. Sounds like you want to implement some custom connectors, [1][2] may be helpful

Re: SecurityManager in Flink

2024-03-06 Thread Yanfei Lei
Hi Kirti Dhar, What is your java version? I guess this problem may be related to FLINK-33309[1]. Maybe you can try adding "-Djava.security.manager" to the java options. [1] https://issues.apache.org/jira/browse/FLINK-33309 Kirti Dhar Upadhyay K via user 于2024年3月6日周三 18:10写道: > > Hi Team, > > > >

Re: What to do about local disks with RocksDB with Kubernetes Operator

2023-10-18 Thread Yanfei Lei
Hi Alex, AFAIK, the emptyDir[1] can be used directly as local disks, and emptyDir can be defined by referring to this pod template[2]. If you want to use local disks through PV, you can first create a statefulSet and mount the PV through volume claim templates[3], the example “Local Recovery Enab

Re: Failure to restore from last completed checkpoint

2023-09-07 Thread Yanfei Lei
Hey Jacqlyn, According to the stack trace, it seems that there is a problem when the checkpoint is triggered. Is this the problem after the restore? would you like to share some logs related to restoring? Best, Yanfei Jacqlyn Bender via user 于2023年9月8日周五 05:11写道: > > Hey folks, > > > We experien

Re: Query around Rocksdb

2023-07-05 Thread Yanfei Lei
> size. Do you suggest adding cleanupInRocksdbCompactFilter(1000) as well? What > will be the impact of this configuration? > > On Tue, Jul 4, 2023 at 8:13 AM Yanfei Lei wrote: >> >> Hi neha, >> >> Due to the limitation of RocksDB, we cannot create a >> str

Re: Query around Rocksdb

2023-07-03 Thread Yanfei Lei
Hi neha, Due to the limitation of RocksDB, we cannot create a strict-capacity-limit LRUCache which shared among rocksDB instance(s), FLINK-15532[1] is created to track this. BTW, have you set TTL for this job[2], TTL can help control the state size. [1] https://issues.apache.org/jira/browse/FLIN

Re: Checkpointed data size is zero

2023-07-03 Thread Yanfei Lei
Hi Kamal, Is the Full Checkpoint Data Size[1] also zero? If not, it may be that no data is processed during this checkpoint. [1] https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/ops/monitoring/checkpoint_monitoring/ Shammon FY 于2023年7月4日周二 09:10写道: > > Hi Kamal, > > You can che

Re: RocksdbStateBackend.enableTtlCompactionFilter

2023-06-20 Thread Yanfei Lei
Hi patricia, The TTL compaction filter in RocksDB has been enabled in 1.10 by default and it is always enabled in 1.11+[1], I think there is no need to explicitly enable the ttl compaction filter. [1] https://nightlies.apache.org/flink/flink-docs-release-1.17/release-notes/flink-1.11/#removal-of-

Re: Changelog fail leads to job fail regardless of tolerable-failed-checkpoints config

2023-06-20 Thread Yanfei Lei
Hi Dongwoo, State changelogs are continuously uploaded to the durable storage when Changelog state backend is enabled. In other words, it will also persist data **outside the checkpoint phase**, and the exception at this time will directly cause the job to fail. And only exceptions in the checkpo

Re: The JobManager is taking minutes to complete and finalize checkpoints despite the Task Managers seem to complete them in a few seconds

2023-05-03 Thread Yanfei Lei
Hi Francesco, The overall checkpoint duration in Flink UI is EndToEndDuration[1], which is the time from Jobmanager triggering checkpoint to collecting the last ack message sent from task manager, depending on the slowest task manager. > "-_message__:__"Completed checkpoint 2515895 for job > fdc

Re: Flink SQL State

2023-04-26 Thread Yanfei Lei
Hi Giannis, Except “default” Colume Family(CF), all other CFs represent the state in rocksdb state backend, the name of a CF is the name of a StateDescriptor. - deduplicate-state is a value state, you can find it in DeduplicateFunctionBase.java and MiniBatchDeduplicateFunctionBase.java, they are

Re: [ANNOUNCE] Flink Table Store Joins Apache Incubator as Apache Paimon(incubating)

2023-03-27 Thread Yanfei Lei
Congratulations! Best Regards, Yanfei ramkrishna vasudevan 于2023年3月27日周一 21:46写道: > > Congratulations !!! > > On Mon, Mar 27, 2023 at 2:54 PM Yu Li wrote: >> >> Dear Flinkers, >> >> >> As you may have noticed, we are pleased to announce that Flink Table Store >> has joined the Apache Incubator

[ANNOUNCE] FRocksDB 6.20.3-ververica-2.0 released

2023-01-30 Thread Yanfei Lei
It is very happy to announce the release of FRocksDB 6.20.3-ververica-2.0. Compiled files for Linux x86, Linux arm, Linux ppc64le, MacOS x86, MacOS arm, and Windows are included in FRocksDB 6.20.3-ververica-2.0 jar, and the FRocksDB in Flink 1.17 would be updated to 6.20.3-ververica-2.0. Release

Re: How does Flink plugin system work?

2023-01-02 Thread Yanfei Lei
Hi Ruibin, "metrics.reporter.prom.class" is deprecated in 1.16, maybe " metrics.reporter.prom.factory.class"[1] can solve your problem. After reading the related code[2], I think the root cause is that " metrics.reporter.prom.class" would load the code via flink's classpath instead of MetricRepor

Re: ZLIB Vulnerability Exposure in Flink statebackend RocksDB

2022-12-12 Thread Yanfei Lei
e? I >> would like to understand the impact if we make changes in our local Flink >> code with regards to testing efforts and any other affected modules? >> >> Can you please clarify this? >> >> Thanks, >> Vidya Sagar. >> >> >> On Wed, D

Re: SNI issue

2022-12-08 Thread Yanfei Lei
Hi, I didn't face this issue, and I'm guessing it might have something to do with the configuration of SSL[1], have you configured the "security.ssl.rest.enabled" option? [1] https://cnightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-ssl/#configuring-ssl Jean-Damien

Re: Exceeded Checkpoint tolerable failure

2022-12-08 Thread Yanfei Lei
Hi Madan, Maybe you can check the value of " *execution.checkpointing.tolerable-failed-checkpoints"*[1] in your application configuration, and try to increase this value? [1] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-tolerable-fail

Re: ZLIB Vulnerability Exposure in Flink statebackend RocksDB

2022-12-07 Thread Yanfei Lei
Hi Vidya Sagar, Thanks for bringing this up. The RocksDB state backend defaults to Snappy[1]. If the compression option is not specifically configured, this vulnerability of ZLIB has no effect on the Flink application for the time being. *> is there any plan in the coming days to address this? *

Re: Task Manager restart and RocksDB incremental checkpoints issue.

2022-11-14 Thread Yanfei Lei
;>> Questions: >>>> - What do you think about the older files that are pulled from the >>>> hostpath to mount path should be deleted first and then create the new >>>> instanceBasepath? >>>> Otherwise, we are going to be ended with the GBs of

Re: Task Manager restart and RocksDB incremental checkpoints issue.

2022-11-10 Thread Yanfei Lei
Hi Vidya Sagar, Could you please share the reason for TaskManager restart? If the machine or JVM process of TaskManager crashes, the `RocksDBKeyedStateBackend` can't be disposed/closed normally, so the existing rocksdb instance directory would remain. BTW, if you use Application Mode on k8s, if

Re: Does reduce function on keyed window gives any guarantee on the order of elements?

2022-11-04 Thread Yanfei Lei
duce function instance will only receive elements from the > same key in order. > > > > *From:* Yanfei Lei > *Sent:* 03 November 2022 03:06 > *To:* Qing Lim > *Cc:* User > *Subject:* Re: Does reduce function on keyed window gives any guarantee > on the order of elements?

Re: Does reduce function on keyed window gives any guarantee on the order of elements?

2022-11-02 Thread Yanfei Lei
Hi Qing, > Does it guarantee that it will be called in the same order of elements in the stream, where value2 is always 1 element after value1? Order is maintained within each parallel stream partition. If the reduce operator only has one sending- sub-task, the answer is YES, but if reduce operato

Re: Broadcast state and OutOfMemoryError: Direct buffer memory

2022-10-21 Thread yanfei lei
Hi Dan, Usually broadcast state needs more network buffers, the network buffer used to exchange data records between tasks would request a portion of direct memory[1], I think it is possible to get the “Direct buffer memory” OOM errors in this scenarios. Maybe you can try to increase taskmanager.m

Re: Presto S3 filesystem access issue - checkpointing - EKS

2022-10-17 Thread yanfei lei
Hi Vignesh, 403 status code makes this look like an authorization issue. > * Some digging into the presto configs and I had this one turned off topresto.s3.use-instance-credentials: "false". (Is this right?)* >From the document[1], it is recommended that set hive. *s3.use-instance-credentials* to

Re: In-flight data within aligned checkpoints/savepoints

2022-08-22 Thread yanfei lei
Hi Darin, > I often see my checkpoints contain "Processed (persisted) in-flight data". The values outside the parentheses represent Processed in-flight data[1], and the values inside the parentheses represent persisted in-flight data[1] , what kind of case did you see in your WEB UI?If the >0 val

Re: get state from window

2022-08-17 Thread yanfei lei
Hi, there are two methods on the Context object that a process() invocation receives that allows access to the two types of state: - globalState(), which allows access to keyed state that is not scoped to a window - windowState(), which allows access to keyed state that is also scoped

Re: How to clean up RocksDB local directory in K8s statefulset

2022-06-27 Thread yanfei lei
Hi Allen, what volumes do you use for your TM pod? If you want your data to be deleted when the pod restarts, you can use an ephemeral volume like EmptyDir. And Flink should remove temporary files automatically when they are not needed anymore(see this discussion