[ANNOUNCE] Apache Flink 2.0 Preview released

2024-10-23 Thread Xintong Song
The Apache Flink community is very happy to announce the release of Apache Flink 2.0 Preview. Apache Flink® is an open-source unified stream and batch data processing framework for distributed, high-performing, always-available, and accurate data applications. This release is a preview of the upc

Join Us for Flink Forward Asia 2024 in Shanghai (Nov 29-30) & Jakarta (Dec 5)!

2024-09-27 Thread Xintong Song
Dear Flink Community, We are excited to share some important news with you! Flink Forward Asia 2024 is coming up with two major events: the first in Shanghai on November 29-30, and the second in Jakarta on December 5. These gatherings will focus on the latest developments, future plans, and pro

[ANNOUNCE] Apache Flink has won the 2023 SIGMOD Systems Award

2023-07-03 Thread Xintong Song
Dear Community, I'm pleased to share this good news with everyone. As some of you may have already heard, Apache Flink has won the 2023 SIGMOD Systems Award [1]. "Apache Flink greatly expanded the use of stream data-processing." -- SIGMOD Awards Committee SIGMOD is one of the most influential da

Re: [DISCUSS] Issue tracking workflow

2022-10-24 Thread Xintong Song
> require all PRs that are merged to exist as a Github Issue? > 3. There's no longer one central administration, which is especially > valuable to track all issues across projects like the different connectors, > Flink ML, Table Store etc. > 4. Our current CI labeling works on the Jir

[DISCUSS] Issue tracking workflow

2022-10-23 Thread Xintong Song
Hi devs and users, As many of you may have already noticed, Infra announced that they will soon disable public Jira account signups [1]. That means, in order for someone who is not yet a Jira user to open or comment on an issue, he/she has to first reach out to a PMC member to create an account fo

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

2022-10-13 Thread Xintong Song
>> operator (usually considered as "numRecordsOut" of tasks). >>>>> > > The original issue was that the numRecordsOut of the sink counted >>>>> both (which is completely wrong). >>>>> > > >>>>> > > A new met

Re: Is Flink 1.15.3 planned in foreseeable future?

2022-10-13 Thread Xintong Song
Actually, this is an on-going discussion related to 1.15.3. The community discovered a breaking change in 1.15.x and is discussing how to resolve this right now [1]. There is very likely a 1.15.3 release after this is resolved. Best, Xintong [1] https://lists.apache.org/thread/vxhty3q97s7pw2zn0

Re: Job Manager getting restarted while restarting task manager

2022-10-12 Thread Xintong Song
g TaskManagers won't make the JobMananger restart. You can > provide the whole log as an attachment to investigate. > > On Wed, 12 Oct 2022 at 6:01 PM, Puneet Duggal > wrote: > >> Hi Xintong Song, >> >> Thanks for your immediate reply. Yes, I do restart task man

Re: Job Manager getting restarted while restarting task manager

2022-10-11 Thread Xintong Song
The log shows that the jobmanager received a SIGTERM signal from external. Depending on how you deploy Flink, that could be a 'kill ' command, or a kubernetes pod removal / eviction, etc. You may want to check where the signal came from. Best, Xintong On Wed, Oct 12, 2022 at 6:26 AM Puneet Dug

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

2022-10-10 Thread Xintong Song
+1 for reverting these changes in Flink 1.16. For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric

Re: Flink TaskManager memory configuration failed

2022-06-22 Thread Xintong Song
512mb is just too small for a TaskManager. You would need to either increase it, or decrease the other memory components (which currently use default values). The 64mb Total Flink Memory comes from the 512mb Total Process Memory minus 192mb minimum JVM Overhead and 256mb default JVM Metaspace. Be

[ANNOUNCE] Welcome to join the Apache Flink community on Slack

2022-06-02 Thread Xintong Song
Hi everyone, I'm very happy to announce that the Apache Flink community has created a dedicated Slack workspace [1]. Welcome to join us on Slack. ## Join the Slack workspace You can join the Slack workspace by either of the following two ways: 1. Click the invitation link posted on the project w

Re: [Discuss] Creating an Apache Flink slack workspace

2022-05-11 Thread Xintong Song
iative is more about making communication more efficient, rather than making information easier to find. Thank you~ Xintong Song On Wed, May 11, 2022 at 5:39 PM Konstantin Knauf wrote: > I don't think we can maintain two additional channels. Some people have > already concerns about

Re: [Discuss] Creating an Apache Flink slack workspace

2022-05-10 Thread Xintong Song
I'm not very familiar with Discourse or Reddit. My impression is that they are not as easy to set up and maintain as Slack. Thank you~ Xintong Song [1] https://asktug.com/ On Tue, May 10, 2022 at 4:50 PM Konstantin Knauf wrote: > Thanks for starting this discussion again. I am pretty

Re: [Discuss] Creating an Apache Flink slack workspace

2022-05-07 Thread Xintong Song
he global English-speaking community. Concerning StackOverFlow, it definitely worth more attention from the community. Thanks for the suggestion / reminder, Piotr & David. I think Slack and StackOverFlow are probably not mutual exclusive. Thank you~ Xintong Song [1] https://zapier.com/ On Sat

Fwd: [Discuss] Creating an Apache Flink slack workspace

2022-05-06 Thread Xintong Song
Thank you~ Xintong Song -- Forwarded message - From: Xintong Song Date: Fri, May 6, 2022 at 5:07 PM Subject: Re: [Discuss] Creating an Apache Flink slack workspace To: private Cc: Chesnay Schepler Hi Chesnay, Correct me if I'm wrong, I don't find this is *

Re: Missing metrics in Flink v 1.15.0 rc-0

2022-04-06 Thread Xintong Song
ading / sink writing data from / to external systems, are not counted. In your case, there's only 1 vertex in the DAG, thus no internal data exchanges. Thank you~ Xintong Song On Wed, Apr 6, 2022 at 11:21 PM Peter Schrott wrote: > Hi there, > > I just successfully upgraded our Flink

Re: TM OOMKilled

2022-02-15 Thread Xintong Song
ix the problem. If the problem is not fixed, but the job runs longer before the OOM happens, then it's likely the 3rd case. Moreover, you can monitor the pod memory footprint changes if such metrics are available. Thank you~ Xintong Song On Tue, Feb 15, 2022 at 11:56 PM Alexey Trenikhun w

Re: TM OOMKilled

2022-02-14 Thread Xintong Song
you share what that is for? Thank you~ Xintong Song On Tue, Feb 15, 2022 at 12:10 PM Alexey Trenikhun wrote: > Hello, > We use RocksDB, but there is no problem with Java heap, which is limited > by 3.523gb, the problem with total container memory. The pod is killed > not due OutO

Re: [DISCUSS] Future of Per-Job Mode

2022-01-24 Thread Xintong Song
mode support shipping local dependencies. - I'm not sure about dropping the per-job mode soonish, as many users are still working with it. We'd better not force these users to migrate to the application mode when upgrading the Flink version. Thank you~ Xintong Song On Fri, Jan 21,

Re: Flink native k8s integration vs. operator

2022-01-13 Thread Xintong Song
Thanks for volunteering to drive this effort, Marton, Thomas and Gyula. Looking forward to the public discussion. Please feel free to reach out if there's anything you need from us. Thank you~ Xintong Song On Fri, Jan 14, 2022 at 8:27 AM Chenya Zhang wrote: > Thanks Thomas, Gy

Re: Flink native k8s integration vs. operator

2022-01-06 Thread Xintong Song
f deployment's resource requirements. In this way, users are free to choose between active and reactive (e.g., HPA) rescaling, while always benefiting from the beyond-deployment lifecycle (upgrades, savepoint management, etc.) and alignment with the K8s ecosystem (Flink client free, operating via

Re: How to handle java.lang.OutOfMemoryError: Metaspace

2021-12-26 Thread Xintong Song
`taskmanager.numberOfTaskSlots`. If you have multiple jobs submitted to a shared Flink cluster, decreasing the number of slots in a task manager should also reduce the amount of classes loaded by the JVM, thus requiring less metaspace. Thank you~ Xintong Song On Mon, Dec 27, 2021 at 9:08 AM John Smith

Re: [DISCUSS] Changing the minimal supported version of Hadoop

2021-12-21 Thread Xintong Song
o longer support hadoop versions < 2.8 at all. And if that is not permitted by our users, we may consider to keep the codebase as is and wait for a bit longer. WDYT? Thank you~ Xintong Song [1] https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_co

Re: Direct buffer memory in job with hbase client

2021-12-15 Thread Xintong Song
job needs, which probably depends on your hbase client configurations. Thank you~ Xintong Song On Wed, Dec 15, 2021 at 1:40 PM Anton wrote: > Hi, from time to time my job is stopping to process messages with warn > message listed below. Tried to increase jobmanager.memory.process.si

Re: High Availability on Kubernetes

2021-10-25 Thread Xintong Song
only pod evictions, but also other problems (jvm out-of-memory, remote storage connection downtime, etc.). Thank you~ Xintong Song On Tue, Oct 26, 2021 at 7:39 AM Deshpande, Omkar wrote: > Hello, > > We are running flink on Kubernetes(Standalone) in application cluster > mode. The

[ANNOUNCE] Release 1.14.0, release candidate #0

2021-08-29 Thread Xintong Song
Hi everyone, The RC0 for Apache Flink 1.14.0 has been created. This is still a preview-only release candidate to drive the current testing efforts and so no official votes will take place. It has all the artifacts that we would typically have for a release, except for the release note and the webs

Re: [ANNOUNCE] Apache Flink 1.13.2 released

2021-08-09 Thread Xintong Song
Thanks Yun and everyone~! Thank you~ Xintong Song On Mon, Aug 9, 2021 at 10:14 PM Till Rohrmann wrote: > Thanks Yun Tang for being our release manager and the great work! Also > thanks a lot to everyone who contributed to this release. > > Cheers, > Till > > On Mon, A

Re: Memory usage UI

2021-07-01 Thread Xintong Song
erhead.[min|max|fraction]'). That helps reserve more native memory in the Kubernetes pod. Thank you~ Xintong Song On Fri, Jul 2, 2021 at 11:51 AM Sudharsan R wrote: > Hi Xintong, > Thanks very much for the response. Let me check out the new UI on flink > 1.12. > > The reason I

Re: Memory usage UI

2021-07-01 Thread Xintong Song
very often lead to confusions. Since Flink-1.12, we have introduced a new web ui for the memory metrics, where the legacy metrics are preserved only for backward compatibility and are placed in an `Advanced` pane. I'd recommend ignoring them in 99% of the cases. Thank you~ Xintong Song

Re: Flink v1.12.2 Kubernetes Session Mode cannot mount log4j.properties in configMap

2021-06-21 Thread Xintong Song
ies in a session cluster [3]. Please be aware that in standalone Kubernetes deployment, Flink looks for log4j-console.properties instead of log4j.properties. By default, this will write the logs to stdout, so that the logs can be viewed by the `kubectl logs` command. Thank you~ Xintong Song [

Re: Resource Planning

2021-06-15 Thread Xintong Song
Hi Thomas, It would be helpful if you can provide the jobmanager/taskmanager logs, and gc logs if possible. Additionally, you may consider to monitor the cpu/memory related metrics [1], see if there's anything abnormal when the problem is observed. Thank you~ Xintong Song [1]

Re: Add control mode for flink

2021-06-08 Thread Xintong Song
modeled as a special case of general control messages. - Watermarks are probably similar to the other control messages. However, it's already exposed to users as public APIs. If we want to migrate it to the new control flow, we'd be very careful not to break any compatibility. Thank you~

Re: Re: Add control mode for flink

2021-06-07 Thread Xintong Song
events from JobMaster 3. Consume control events from arbitrary operators downstream where the events are produced Thank you~ Xintong Song On Tue, Jun 8, 2021 at 1:37 PM Yun Gao wrote: > Very thanks Jiangang for bringing this up and very thanks for the > discussion! > >

Re: Add control mode for flink

2021-06-06 Thread Xintong Song
ntrolling feature, but potentially other future features as well. - AFAICS, it's non-trivial to make a 3rd-party dynamic configuration framework work together with Flink's consistency mechanism. Thank you~ Xintong Song On Mon, Jun 7, 2021 at 11:05 AM 刘建刚 wrote: > Thank you for t

Re: In native k8s application mode, how can I know whether the job is failed or finished?

2021-06-03 Thread Xintong Song
session cluster. Thus, status of historical jobs can be accessed via the JM. 2. You can try setting up a history server [1], where information of finished jobs can be archived. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced

Re: yarn ship from s3

2021-05-26 Thread Xintong Song
straightforward. Unfortunately, these efforts are still in progress, and are more or less staled recently. Thank you~ Xintong Song [1] https://issues.apache.org/jira/browse/FLINK-20681 [2] https://issues.apache.org/jira/browse/FLINK-20811 [3] https://issues.apache.org/jira/browse/FLINK-20867 On

Re: reactive mode and back pressure

2021-05-17 Thread Xintong Song
Yes, it does. Internally, each re-scheduling is performed as stop-and-resume the job, similar to a failover. Without checkpoints, the job will always restore from the very beginning. Thank you~ Xintong Song On Mon, May 17, 2021 at 2:54 PM Alexey Trenikhun wrote: > Hi Xintong, >

Re: The heartbeat of JobManager timed out

2021-05-16 Thread Xintong Song
s. This is usually observed for large scale jobs (in terms of number of vertices and parallelism). In that case, we would have to increase the heartbeat timeout. Thank you~ Xintong Song On Mon, May 17, 2021 at 11:12 AM Smile wrote: > JM log shows this: > > INFO org.apache.fli

Re: reactive mode and back pressure

2021-05-16 Thread Xintong Song
work with both the default and the new reactive modes. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/checkpoints/#unaligned-checkpoints On Fri, May 14, 2021 at 11:29 PM Alexey Trenikhun wrote: > Hello, > > Is new rea

Re: How does JobManager terminate dangling task manager

2021-05-13 Thread Xintong Song
eed by the checkpointing mechanism. The new task does not resume from the exact position where the old task is stopped. Instead, it resumes from the last successful checkpoint. Thank you~ Xintong Song On Thu, May 13, 2021 at 5:38 PM Guowei Ma wrote: > Hi, > In fact, not only JobManager

Re: Question regarding cpu limit config in Flink standalone mode

2021-05-06 Thread Xintong Song
guration option `kubernets.taskmanager.cpu` controls the cpu resource of pods Flink requests from Kubernetes. Thank you~ Xintong Song On Fri, May 7, 2021 at 10:35 AM Fan Xie wrote: > Hi Flink Community, > > Recently I am working on an auto-scaling project that needs to dynamically > adjust the cpu

Re: [ANNOUNCE] Apache Flink 1.13.0 released

2021-05-05 Thread Xintong Song
Thanks Dawid & Guowei as the release managers, and everyone who has contributed to this release. Thank you~ Xintong Song On Thu, May 6, 2021 at 9:51 AM Leonard Xu wrote: > Thanks Dawid & Guowei for the great work, thanks everyone involved. > > Best, > Leonard > &

Re: [ANNOUNCE] Flink Jira Bot fully live (& Useful Filters to Work with the Bot)

2021-04-22 Thread Xintong Song
Thanks for driving this, Konstantin. Great job~! Thank you~ Xintong Song On Thu, Apr 22, 2021 at 11:57 PM Matthias Pohl wrote: > Thanks for setting this up, Konstantin. +1 > > On Thu, Apr 22, 2021 at 11:16 AM Konstantin Knauf > wrote: > >> Hi everyone, >> >&

Re: Clarification about Flink's managed memory and metric monitoring

2021-04-13 Thread Xintong Song
These metrics should also be available via REST. You can check the original design doc [1] for which metrics the UI is using. Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager On Tue, Apr 13, 2021 at 9:08 PM Alexis Sarda

Re: Clarification about Flink's managed memory and metric monitoring

2021-04-13 Thread Xintong Song
-XX:MaxDirectMemorySize' and is not controlled by the garbage collectors. Thank you~ Xintong Song On Tue, Apr 13, 2021 at 7:53 PM Alexis Sarda-Espinosa < alexis.sarda-espin...@microfocus.com> wrote: > Hello, > > > > I have a Flink TM configured with taskmanager.memory.mana

Re: [BULK]Re: [SURVEY] Remove Mesos support

2021-03-28 Thread Xintong Song
+1 It's already a matter of fact for a while that we no longer port new features to the Mesos deployment. Thank you~ Xintong Song On Fri, Mar 26, 2021 at 10:37 PM Till Rohrmann wrote: > +1 for officially deprecating this component for the 1.13 release. > > Cheers, > Till &

Re: Evenly Spreading Out Source Tasks

2021-03-15 Thread Xintong Song
If all the tasks have the same parallelism 36, your job should only allocate 36 slots. The evenly-spread-out-slots option should help in your case. Is it possible for you to share the complete jobmanager logs? Thank you~ Xintong Song On Tue, Mar 16, 2021 at 12:46 AM Aeden Jameson wrote

Re: Evenly Spreading Out Source Tasks

2021-03-14 Thread Xintong Song
s containing a subtask of it, and there's no guarantee which 36 out of the 54 contain it. Thank you~ Xintong Song On Mon, Mar 15, 2021 at 3:54 AM Chesnay Schepler wrote: > Is this a brand-new job, with the cluster having all 18 TMs at the time > of submission? (or did you add more TMs

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2021-03-07 Thread Xintong Song
Hi Hemant, I don't see any problem in your settings. Any exceptions suggesting why TM containers are not coming up? Thank you~ Xintong Song On Sat, Mar 6, 2021 at 3:53 PM bat man wrote: > Hi Xintong Song, > I tried using the java options to generate heap dump referring to docs[1]

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2021-03-05 Thread Xintong Song
memory leak is suspected, to further understand where the memory is consumed, you may need to dump the heap on OOMs and looking for unexpected memory usages leveraging profiling tools. Thank you~ Xintong Song [1] https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/memleaks002.html

Re: Scaling Higher than 10k Nodes

2021-03-04 Thread Xintong Song
cess, such as tremendous memory consumption, buzy rpc main thread, etc. To make that case work, we did many optimizations on our internal flink version, which we are trying to contribute to the community. See FLINK-21110 [1] for the details. Thank you~ Xintong Song [1] https://issues.apache.org/j

Re: Flink problem

2021-02-19 Thread Xintong Song
What you're looking for might be Session Window[1]. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#session-windows On Fri, Feb 19, 2021 at 7:35 PM ゞ野蠻遊戲χ wrote: > hi all > > For example, if A

Re: Memory usage increases on every job restart resulting in eventual OOMKill

2021-02-02 Thread Xintong Song
0 bytes > INFO [] - Network: 128.000mb (134217730 bytes) > INFO [] - JVM Metaspace: 256.000mb (268435456 bytes) > INFO [] - JVM Overhead: 192.000mb (201326592 bytes) Thank you~ Xintong Song On Tue, Feb 2, 2021

Re: Memory usage increases on every job restart resulting in eventual OOMKill

2021-02-02 Thread Xintong Song
Hi Randal, The image is too blurred to be clearly seen. I have a few questions. - IIUC, you are using the standalone K8s deployment [1], not the native K8s deployment [2]. Could you confirm that? - How is the memory measured? Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-31 Thread Xintong Song
re of any issue related to the upgrading of the ZK version that may cause the leadership loss. Thank you~ Xintong Song On Sun, Jan 31, 2021 at 4:14 AM Colletta, Edward wrote: > “but I'm not aware of any similar issue reported since the upgrading” > > For the record, we experienced th

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-29 Thread Xintong Song
Thank you~ Xintong Song On Sat, Jan 30, 2021 at 8:27 AM Xintong Song wrote: > There's indeed a ZK version upgrading during 1.9 and 1.11, but I'm not > aware of any similar issue reported since the upgrading. > I would suggest the following: > - Turn on the DEBUG

[ANNOUNCE] Apache Flink 1.10.3 released

2021-01-28 Thread Xintong Song
Jira: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12348668 We would like to thank all contributors of the Apache Flink community who made this release possible! Regards, Xintong Song

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-28 Thread Xintong Song
r complain about timeout, and there's no gc issue spotted, I would consider a network instability. Thank you~ Xintong Song On Fri, Jan 29, 2021 at 3:15 AM Lu Niu wrote: > After checking the log I found the root cause is zk client timeout on TM: > ``` > 2021

[ANNOUNCE] Apache Flink 1.12.1 released

2021-01-18 Thread Xintong Song
The Apache Flink community is very happy to announce the release of Apache Flink 1.12.1, which is the first bugfix release for the Apache Flink 1.12 series. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming a

Re: Resource changed on src filesystem after upgrade

2021-01-18 Thread Xintong Song
ed as `yarn.ship-files`, `yarn.ship-archives` or `yarn.provided.lib.dirs`? This helps us to locate the code path that this file went through. Thank you~ Xintong Song On Sun, Jan 17, 2021 at 10:32 PM Mark Davis wrote: > Hi all, > I am upgrading my DataSet jobs from Flink 1.8 to 1.12. > Aft

Re: How does Flink handle shorted lived keyed streams

2020-12-24 Thread Xintong Song
I believe what you are looking for is the State TTL [1][2]. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#state-time-to-live-ttl [2] https://ci.apache.org/projects/flink/flink-docs-stabledev/table/config.html#table-exec-state

[ANNOUNCE] Apache Flink 1.11.3 released

2020-12-18 Thread Xintong Song
The Apache Flink community is very happy to announce the release of Apache Flink 1.11.3, which is the third bugfix release for the Apache Flink 1.11 series. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming a

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2020-12-17 Thread Xintong Song
I'm not aware of any significant changes to the HA components between 1.9/1.11. Would you mind sharing the complete jobmanager/taskmanager logs? Thank you~ Xintong Song On Fri, Dec 18, 2020 at 8:53 AM Lu Niu wrote: > Hi, Xintong > > Thanks for replying and your suggestion. I

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2020-12-16 Thread Xintong Song
into the ZooKeeper logs checking why RM's leadership is revoked. Thank you~ Xintong Song On Thu, Dec 17, 2020 at 8:42 AM Lu Niu wrote: > Hi, Flink users > > Recently we migrated to flink 1.11 and see exceptions like: > ``` > 2020-12-

Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-06 Thread Xintong Song
FYI, I've opened FLINK-20503 for this. https://issues.apache.org/jira/browse/FLINK-20503 Thank you~ Xintong Song On Mon, Dec 7, 2020 at 11:10 AM Xintong Song wrote: > I forgot to mention that it is designed that task managers always have > `Double#MAX_VALUE` cpu cores in loca

Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-06 Thread Xintong Song
sers. Will fire an issue on that. Thank you~ Xintong Song On Mon, Dec 7, 2020 at 11:03 AM Xintong Song wrote: > Hi Rex, > > We're running this in a local environment so that may be contributing to >> what we're seeing. >> > Just to double check on this. By `

Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-06 Thread Xintong Song
requested in such cases. - kubernetes.jobmanager.cpu - kubernetes.taskmanager.cpu - yarn.appmaster.vcores - yarn.containers.vcores - mesos.resourcemanager.tasks.cpus Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memo

Re: Flink AutoScaling EMR

2020-11-15 Thread Xintong Song
on the decommissioning node will be killed. Thank you~ Xintong Song On Fri, Nov 13, 2020 at 2:57 PM Robert Metzger wrote: > Hi, > it seems that YARN has a feature for targeting specific hardware: > https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.htm

Re: Insufficient number of network buffers for simple last_value aggregate

2020-10-30 Thread Xintong Song
Hi Schneider, The error message suggests that your task managers are not configured with enough network memory. You would need to increase the network memory configuration. See this doc [1] for more details. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

2020-10-29 Thread Xintong Song
, you might want to look into this comment [1] in FLINK-18712. - If neither of the above actions helps, we might need to leverage tools (e.g., JVM NMT [2]) to track the native memory usages and see where exactly the leak comes from. Thank you~ Xintong Song [1] https://issues.apache.org/jira/b

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

2020-10-29 Thread Xintong Song
upgrade to 1.10.2, to include the latest bug fixes on the 1.10 release. Thank you~ Xintong Song On Thu, Oct 29, 2020 at 4:41 PM Ori Popowski wrote: > Hi, > > PID 20331 is indeed the Flink process, specifically the TaskManager > process. > > - Workload is a streaming workload

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

2020-10-29 Thread Xintong Song
can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages. Thank you~ Xintong Song On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski wrote: > Hi Xintong, > > See here: > > # Top me

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

2020-10-28 Thread Xintong Song
n the `top` command - Look into the `/proc/meminfo` file - Any container memory usage metrics that are available to your Yarn cluster Thank you~ Xintong Song On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski wrote: > After the job is running for 10 days in production, TaskManagers start > f

Re: [SURVEY] Remove Mesos support

2020-10-26 Thread Xintong Song
early next month. It would be greatly appreciated if you fork as experienced Flink on Mesos users can help with verifying the release candidates. Thank you~ Xintong Song [1] https://issues.apache.org/jira/browse/FLINK-17402?jql=project%20%3D%20FLINK%20AND%20component%20%3D%20%22Deployment%20%2F

Re: [SURVEY] Remove Mesos support

2020-10-25 Thread Xintong Song
resource management improvements may not be ported to Mesos), while keeping other components up-to-date (e.g., improvements from programming APIs, operators, state backens, etc.)? Thank you~ Xintong Song On Sat, Oct 24, 2020 at 2:48 AM Lasse Nedergaard < lassenedergaardfl...@gmail.com> wrote:

Re: Trying to run Flink tests

2020-10-23 Thread Xintong Song
think it should be fine. Thank you~ Xintong Song [1] https://issues.apache.org/jira/browse/FLINK-19665 On Sat, Oct 24, 2020 at 5:56 AM Dan Hill wrote: > Changing down to maven 3.2 shows an error. It seems like I'm hitting > flaky tests. I hit one error and then a different error

Re: [SURVEY] Remove Mesos support

2020-10-23 Thread Xintong Song
oices definitely matter a lot for this community. Either way, it would be good to draw users attention to this discussion early. Thank you~ Xintong Song On Fri, Oct 23, 2020 at 7:53 PM Konstantin Knauf wrote: > Hi Robert, > > +1 to the plan you outlined. If we were to drop support in F

Re: Trying to run Flink tests

2020-10-22 Thread Xintong Song
3.6.3. I'm not sure whether the maven version is related, but maybe you can try it out with 3.2.5. And if it turns out worked, we may fire a issue at the Apache Maven community. Thank you~ Xintong Song On Thu, Oct 22, 2020 at 12:31 PM Dan Hill wrote: > 1) I don't see anything use

Re: Trying to run Flink tests

2020-10-21 Thread Xintong Song
n logs. - Quick question: which PR are you working on? By any chance you called `System.exit()` in your codes? Thank you~ Xintong Song On Thu, Oct 22, 2020 at 5:59 AM Dan Hill wrote: > Sure, here's a link > <https://drive.google.com/file/d/13Q7h77zG-2vp7gJOke8QAzLtKLKIPuTf/view?usp=sh

Re: Trying to run Flink tests

2020-10-21 Thread Xintong Song
Would you be able to share the complete maven logs and the command? And what is the maven version? Thank you~ Xintong Song On Wed, Oct 21, 2020 at 1:37 AM Dan Hill wrote: > Hi Xintong! > > No changes. I tried -X and no additional log information is logged. > -DfailIfNoTests=fa

Re: Trying to run Flink tests

2020-10-20 Thread Xintong Song
intended to execute the tests locally, you can try the following actions. I'm not sure whether that helps though. - Try to add '-DfailIfNoTests=false' to your maven command. - Execute the maven command with '-X' to print all the debug logs. Thank you~ Xintong Song On Tu

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
No worries :) Thank you~ Xintong Song On Mon, Oct 12, 2020 at 2:48 PM Paul Lam wrote: > Sorry for the misspelled name, Xintong > > Best, > Paul Lam > > 2020年10月12日 14:46,Paul Lam 写道: > > Hi Xingtong, > > Thanks a lot for the pointer! > > It’s good to

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
FYI, I just created FLINK-19568 for tracking this issue. Thank you~ Xintong Song [1] https://issues.apache.org/jira/browse/FLINK-19568 On Mon, Oct 12, 2020 at 2:18 PM Xintong Song wrote: > Hi Paul, > > Thanks for reporting this. > > Indeed, Flink's RM currently p

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
al states in the rpc main thread. With FLINK-19241, this can be achieved easily by delegating the work to the io executor. Thank you~ Xintong Song On Mon, Oct 12, 2020 at 12:44 PM Paul Lam wrote: > Hi, > > After FLINK-13184 is implemented (even with Flink 1.11), occasionally >

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-24 Thread Xintong Song
g released, see if we can do something about it. Thank you~ Xintong Song On Thu, Sep 24, 2020 at 6:35 PM Claude M wrote: > I have 35 task managers, 1 slot on each. I'm running a total of 7 jobs in > the cluster. All the slots are occupied. When you say that

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-23 Thread Xintong Song
] and build your custom image (from the 1.0.2 image and replace the flink distribution with the one you built). Thank you~ Xintong Song [1] https://github.com/apache/flink/tree/release-1.10 [2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/flinkDev/building.html On Wed, S

Re: Debugging "Container is running beyond physical memory limits" on YARN for a long running streaming job

2020-09-22 Thread Xintong Song
that fixes your problem. Given that it could take weeks to reproduce your problem, I would suggest to keep track of the native memory usage with jemalloc and jeprof. This should provide direct information about which component is using extra memory. Thank you~ Xintong Song On Tue, Sep 22

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-21 Thread Xintong Song
Thanks for the input, Brain. This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2. Maybe Claude can further confirm it. Thank you~ Xintong Song On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian wrote: > Hi Xintong and Cla

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-21 Thread Xintong Song
dump, we can look into it later. Thank you~ Xintong Song On Mon, Sep 21, 2020 at 9:37 PM Claude M wrote: > Hi Xintong, > > Thanks for your reply. Here is the command output w/ the java.opts: > > /usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC > -XX:+Hea

Re: Debugging "Container is running beyond physical memory limits" on YARN for a long running streaming job

2020-09-20 Thread Xintong Song
t trust Flink's "Non-Heap" metrics. It is practically helpless and misleading. The "Non-Heap" accounts for SOME of the non-heap memory usage, but NOT ALL of them. The community is working on a new set of metrics and Web UI for the task manager memory tuning. Thank you~ Xinton

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-20 Thread Xintong Song
do. > - Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes? - Which cluster mode are you using? Job cluster, session cluster, or the application mode? Thank you~ Xintong Song On Sat, Sep 19, 2020 at 1:22 AM Claude M wrote: > Hello, > > I upgrad

Re: [DISCUSS] FLIP-144: Native Kubernetes HA for Flink

2020-09-16 Thread Xintong Song
rs can write/remove the stored object. What if the previous owner failed to release the lock (e.g., dead before releasing)? Would there be any problem? ## HA storage > HA data clean up If the ConfigMap is destroyed on `kubectl delete deploy `, how are the HA dada retained? Thank you~ Xintong So

Re: Use of slot sharing groups causing workflow to hang

2020-09-09 Thread Xintong Song
, thus separating the pipeline into several slot sharing groups will not bring any benefit. If you are just trying out with the slot sharing groups or preparing for later deploying the execution to a distributed cluster, then there should be no problem. Thank you~ Xintong Song On Thu, Sep 10, 20

Re: runtime memory management

2020-08-31 Thread Xintong Song
] the cluster to allocate slots evenly across task managers. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html#tasks-and-operator-chains [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/internals/job_scheduling

Re: runtime memory management

2020-08-30 Thread Xintong Song
you~ Xintong Song On Mon, Aug 31, 2020 at 1:33 PM lec ssmi wrote: > HI: > Generally speaking, when we submitting the flink program, the number of > taskmanager and the memory of each tn will be specified. And the smallest > real execution unit of flink should be operator.

Re: [ANNOUNCE] New PMC member: Dian Fu

2020-08-27 Thread Xintong Song
Congratulations Dian~! Thank you~ Xintong Song On Thu, Aug 27, 2020 at 7:42 PM Jark Wu wrote: > Congratulations Dian! > > Best, > Jark > > On Thu, 27 Aug 2020 at 19:37, Leonard Xu wrote: > > > Congrats, Dian! Well deserved. > > > > Best > > Le

Re: OOM error for heap state backend.

2020-08-23 Thread Xintong Song
Hi Vishwas, According to the log, heap space is 13+GB, which looks fine. Several reason might lead to the heap space OOM: - Memory leak - Not enough GC threads - Concurrent GC starts too late - ... I would suggest taking a look at the GC logs. Thank you~ Xintong Song On Fri

Re: Hostname for taskmanagers when running in docker

2020-08-13 Thread Xintong Song
ption `taskmanager.host` for your task managers, see if that is reflected in the metrics. Thank you~ Xintong Song On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov wrote: > Hello, > > After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname > of the taskmanagers reported to

Re: Flink CPU load metrics in K8s

2020-08-12 Thread Xintong Song
I did a simple test on my laptop, launching a docker container with cpu limit configured. Inside the container, I can still see all my machine's cpus. Thank you~ Xintong Song On Wed, Aug 12, 2020 at 1:19 AM Bajaj, Abhinav wrote: > Hi, > > > > Reaching out to folks running Fl

  1   2   3   >