Re: Flink CPU load metrics in K8s

2020-08-12 Thread Xintong Song
I did a simple test on my laptop, launching a docker container with cpu limit configured. Inside the container, I can still see all my machine's cpus. Thank you~ Xintong Song On Wed, Aug 12, 2020 at 1:19 AM Bajaj, Abhinav wrote: > Hi, > > > > Reaching out to folks running Fl

Re: Hostname for taskmanagers when running in docker

2020-08-13 Thread Xintong Song
ption `taskmanager.host` for your task managers, see if that is reflected in the metrics. Thank you~ Xintong Song On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov wrote: > Hello, > > After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname > of the taskmanagers reported to

Re: OOM error for heap state backend.

2020-08-23 Thread Xintong Song
Hi Vishwas, According to the log, heap space is 13+GB, which looks fine. Several reason might lead to the heap space OOM: - Memory leak - Not enough GC threads - Concurrent GC starts too late - ... I would suggest taking a look at the GC logs. Thank you~ Xintong Song On Fri

Re: [ANNOUNCE] New PMC member: Dian Fu

2020-08-27 Thread Xintong Song
Congratulations Dian~! Thank you~ Xintong Song On Thu, Aug 27, 2020 at 7:42 PM Jark Wu wrote: > Congratulations Dian! > > Best, > Jark > > On Thu, 27 Aug 2020 at 19:37, Leonard Xu wrote: > > > Congrats, Dian! Well deserved. > > > > Best > > Le

Re: runtime memory management

2020-08-30 Thread Xintong Song
you~ Xintong Song On Mon, Aug 31, 2020 at 1:33 PM lec ssmi wrote: > HI: > Generally speaking, when we submitting the flink program, the number of > taskmanager and the memory of each tn will be specified. And the smallest > real execution unit of flink should be operator.

Re: runtime memory management

2020-08-31 Thread Xintong Song
] the cluster to allocate slots evenly across task managers. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html#tasks-and-operator-chains [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/internals/job_scheduling

Re: Use of slot sharing groups causing workflow to hang

2020-09-09 Thread Xintong Song
, thus separating the pipeline into several slot sharing groups will not bring any benefit. If you are just trying out with the slot sharing groups or preparing for later deploying the execution to a distributed cluster, then there should be no problem. Thank you~ Xintong Song On Thu, Sep 10, 20

Re: [DISCUSS] FLIP-144: Native Kubernetes HA for Flink

2020-09-16 Thread Xintong Song
rs can write/remove the stored object. What if the previous owner failed to release the lock (e.g., dead before releasing)? Would there be any problem? ## HA storage > HA data clean up If the ConfigMap is destroyed on `kubectl delete deploy `, how are the HA dada retained? Thank you~ Xintong So

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-20 Thread Xintong Song
do. > - Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes? - Which cluster mode are you using? Job cluster, session cluster, or the application mode? Thank you~ Xintong Song On Sat, Sep 19, 2020 at 1:22 AM Claude M wrote: > Hello, > > I upgrad

Re: Debugging "Container is running beyond physical memory limits" on YARN for a long running streaming job

2020-09-20 Thread Xintong Song
t trust Flink's "Non-Heap" metrics. It is practically helpless and misleading. The "Non-Heap" accounts for SOME of the non-heap memory usage, but NOT ALL of them. The community is working on a new set of metrics and Web UI for the task manager memory tuning. Thank you~ Xinton

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-21 Thread Xintong Song
dump, we can look into it later. Thank you~ Xintong Song On Mon, Sep 21, 2020 at 9:37 PM Claude M wrote: > Hi Xintong, > > Thanks for your reply. Here is the command output w/ the java.opts: > > /usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC > -XX:+Hea

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-21 Thread Xintong Song
Thanks for the input, Brain. This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2. Maybe Claude can further confirm it. Thank you~ Xintong Song On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian wrote: > Hi Xintong and Cla

Re: Debugging "Container is running beyond physical memory limits" on YARN for a long running streaming job

2020-09-22 Thread Xintong Song
that fixes your problem. Given that it could take weeks to reproduce your problem, I would suggest to keep track of the native memory usage with jemalloc and jeprof. This should provide direct information about which component is using extra memory. Thank you~ Xintong Song On Tue, Sep 22

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-23 Thread Xintong Song
] and build your custom image (from the 1.0.2 image and replace the flink distribution with the one you built). Thank you~ Xintong Song [1] https://github.com/apache/flink/tree/release-1.10 [2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/flinkDev/building.html On Wed, S

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-24 Thread Xintong Song
g released, see if we can do something about it. Thank you~ Xintong Song On Thu, Sep 24, 2020 at 6:35 PM Claude M wrote: > I have 35 task managers, 1 slot on each. I'm running a total of 7 jobs in > the cluster. All the slots are occupied. When you say that

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
al states in the rpc main thread. With FLINK-19241, this can be achieved easily by delegating the work to the io executor. Thank you~ Xintong Song On Mon, Oct 12, 2020 at 12:44 PM Paul Lam wrote: > Hi, > > After FLINK-13184 is implemented (even with Flink 1.11), occasionally >

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
FYI, I just created FLINK-19568 for tracking this issue. Thank you~ Xintong Song [1] https://issues.apache.org/jira/browse/FLINK-19568 On Mon, Oct 12, 2020 at 2:18 PM Xintong Song wrote: > Hi Paul, > > Thanks for reporting this. > > Indeed, Flink's RM currently p

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
No worries :) Thank you~ Xintong Song On Mon, Oct 12, 2020 at 2:48 PM Paul Lam wrote: > Sorry for the misspelled name, Xintong > > Best, > Paul Lam > > 2020年10月12日 14:46,Paul Lam 写道: > > Hi Xingtong, > > Thanks a lot for the pointer! > > It’s good to

Re: Trying to run Flink tests

2020-10-20 Thread Xintong Song
intended to execute the tests locally, you can try the following actions. I'm not sure whether that helps though. - Try to add '-DfailIfNoTests=false' to your maven command. - Execute the maven command with '-X' to print all the debug logs. Thank you~ Xintong Song On Tu

Re: Trying to run Flink tests

2020-10-21 Thread Xintong Song
Would you be able to share the complete maven logs and the command? And what is the maven version? Thank you~ Xintong Song On Wed, Oct 21, 2020 at 1:37 AM Dan Hill wrote: > Hi Xintong! > > No changes. I tried -X and no additional log information is logged. > -DfailIfNoTests=fa

Re: Trying to run Flink tests

2020-10-21 Thread Xintong Song
n logs. - Quick question: which PR are you working on? By any chance you called `System.exit()` in your codes? Thank you~ Xintong Song On Thu, Oct 22, 2020 at 5:59 AM Dan Hill wrote: > Sure, here's a link > <https://drive.google.com/file/d/13Q7h77zG-2vp7gJOke8QAzLtKLKIPuTf/view?usp=sh

Re: Trying to run Flink tests

2020-10-22 Thread Xintong Song
3.6.3. I'm not sure whether the maven version is related, but maybe you can try it out with 3.2.5. And if it turns out worked, we may fire a issue at the Apache Maven community. Thank you~ Xintong Song On Thu, Oct 22, 2020 at 12:31 PM Dan Hill wrote: > 1) I don't see anything use

Re: [SURVEY] Remove Mesos support

2020-10-23 Thread Xintong Song
oices definitely matter a lot for this community. Either way, it would be good to draw users attention to this discussion early. Thank you~ Xintong Song On Fri, Oct 23, 2020 at 7:53 PM Konstantin Knauf wrote: > Hi Robert, > > +1 to the plan you outlined. If we were to drop support in F

Re: Trying to run Flink tests

2020-10-23 Thread Xintong Song
think it should be fine. Thank you~ Xintong Song [1] https://issues.apache.org/jira/browse/FLINK-19665 On Sat, Oct 24, 2020 at 5:56 AM Dan Hill wrote: > Changing down to maven 3.2 shows an error. It seems like I'm hitting > flaky tests. I hit one error and then a different error

Re: [SURVEY] Remove Mesos support

2020-10-25 Thread Xintong Song
resource management improvements may not be ported to Mesos), while keeping other components up-to-date (e.g., improvements from programming APIs, operators, state backens, etc.)? Thank you~ Xintong Song On Sat, Oct 24, 2020 at 2:48 AM Lasse Nedergaard < lassenedergaardfl...@gmail.com> wrote:

Re: [SURVEY] Remove Mesos support

2020-10-26 Thread Xintong Song
early next month. It would be greatly appreciated if you fork as experienced Flink on Mesos users can help with verifying the release candidates. Thank you~ Xintong Song [1] https://issues.apache.org/jira/browse/FLINK-17402?jql=project%20%3D%20FLINK%20AND%20component%20%3D%20%22Deployment%20%2F

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

2020-10-28 Thread Xintong Song
n the `top` command - Look into the `/proc/meminfo` file - Any container memory usage metrics that are available to your Yarn cluster Thank you~ Xintong Song On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski wrote: > After the job is running for 10 days in production, TaskManagers start > f

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

2020-10-29 Thread Xintong Song
can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages. Thank you~ Xintong Song On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski wrote: > Hi Xintong, > > See here: > > # Top me

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

2020-10-29 Thread Xintong Song
upgrade to 1.10.2, to include the latest bug fixes on the 1.10 release. Thank you~ Xintong Song On Thu, Oct 29, 2020 at 4:41 PM Ori Popowski wrote: > Hi, > > PID 20331 is indeed the Flink process, specifically the TaskManager > process. > > - Workload is a streaming workload

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

2020-10-29 Thread Xintong Song
, you might want to look into this comment [1] in FLINK-18712. - If neither of the above actions helps, we might need to leverage tools (e.g., JVM NMT [2]) to track the native memory usages and see where exactly the leak comes from. Thank you~ Xintong Song [1] https://issues.apache.org/jira/b

Re: Insufficient number of network buffers for simple last_value aggregate

2020-10-30 Thread Xintong Song
Hi Schneider, The error message suggests that your task managers are not configured with enough network memory. You would need to increase the network memory configuration. See this doc [1] for more details. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release

Re: Flink AutoScaling EMR

2020-11-15 Thread Xintong Song
on the decommissioning node will be killed. Thank you~ Xintong Song On Fri, Nov 13, 2020 at 2:57 PM Robert Metzger wrote: > Hi, > it seems that YARN has a feature for targeting specific hardware: > https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.htm

Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-06 Thread Xintong Song
requested in such cases. - kubernetes.jobmanager.cpu - kubernetes.taskmanager.cpu - yarn.appmaster.vcores - yarn.containers.vcores - mesos.resourcemanager.tasks.cpus Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memo

Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-06 Thread Xintong Song
sers. Will fire an issue on that. Thank you~ Xintong Song On Mon, Dec 7, 2020 at 11:03 AM Xintong Song wrote: > Hi Rex, > > We're running this in a local environment so that may be contributing to >> what we're seeing. >> > Just to double check on this. By `

Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-06 Thread Xintong Song
FYI, I've opened FLINK-20503 for this. https://issues.apache.org/jira/browse/FLINK-20503 Thank you~ Xintong Song On Mon, Dec 7, 2020 at 11:10 AM Xintong Song wrote: > I forgot to mention that it is designed that task managers always have > `Double#MAX_VALUE` cpu cores in loca

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2020-12-16 Thread Xintong Song
into the ZooKeeper logs checking why RM's leadership is revoked. Thank you~ Xintong Song On Thu, Dec 17, 2020 at 8:42 AM Lu Niu wrote: > Hi, Flink users > > Recently we migrated to flink 1.11 and see exceptions like: > ``` > 2020-12-

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2020-12-17 Thread Xintong Song
I'm not aware of any significant changes to the HA components between 1.9/1.11. Would you mind sharing the complete jobmanager/taskmanager logs? Thank you~ Xintong Song On Fri, Dec 18, 2020 at 8:53 AM Lu Niu wrote: > Hi, Xintong > > Thanks for replying and your suggestion. I

[ANNOUNCE] Apache Flink 1.11.3 released

2020-12-18 Thread Xintong Song
The Apache Flink community is very happy to announce the release of Apache Flink 1.11.3, which is the third bugfix release for the Apache Flink 1.11 series. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming a

Re: How does Flink handle shorted lived keyed streams

2020-12-24 Thread Xintong Song
I believe what you are looking for is the State TTL [1][2]. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#state-time-to-live-ttl [2] https://ci.apache.org/projects/flink/flink-docs-stabledev/table/config.html#table-exec-state

Re: Resource changed on src filesystem after upgrade

2021-01-18 Thread Xintong Song
ed as `yarn.ship-files`, `yarn.ship-archives` or `yarn.provided.lib.dirs`? This helps us to locate the code path that this file went through. Thank you~ Xintong Song On Sun, Jan 17, 2021 at 10:32 PM Mark Davis wrote: > Hi all, > I am upgrading my DataSet jobs from Flink 1.8 to 1.12. > Aft

[ANNOUNCE] Apache Flink 1.12.1 released

2021-01-18 Thread Xintong Song
The Apache Flink community is very happy to announce the release of Apache Flink 1.12.1, which is the first bugfix release for the Apache Flink 1.12 series. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming a

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-03 Thread Xintong Song
ime.highavailability org.apache.flink.runtime.leaderretrieval org.apache.zookeeper Thank you~ Xintong Song On Wed, Mar 4, 2020 at 5:42 AM Bajaj, Abhinav wrote: > Hi, > > > > We recently came across an issue where JobMaster does not register with > ResourceManager in Fink high availability set

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-04 Thread Xintong Song
hose from the job restart to the NoResourceAvailableException) to find out which is the case. Thank you~ Xintong Song On Thu, Mar 5, 2020 at 7:30 AM Bajaj, Abhinav wrote: > While I setup to reproduce the issue with debug logs, I would like to > share more information I noticed in INFO logs. &

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-05 Thread Xintong Song
the rest part of the log (from where the current one ends to the NoResourceAvailableException) to tell what happened during the scheduling. Also, could you confirm how many TMs do you use? Thank you~ Xintong Song On Fri, Mar 6, 2020 at 5:55 AM Bajaj, Abhinav wrote: > Hi Xintong, &g

Re: scaling issue Running Flink on Kubernetes

2020-03-10 Thread Xintong Song
skew ease? I suspect the performance difference might be an outcome of some warming up issues. E.g., the existing TMs might have some file already localized, or some memory buffers already promoted to the JVM tenured area, while the new TMs have not. Thank you~ Xintong Song On Wed, Mar 11

Re: scaling issue Running Flink on Kubernetes

2020-03-10 Thread Xintong Song
rea. Thank you~ Xintong Song On Wed, Mar 11, 2020 at 10:37 AM Eleanore Jin wrote: > _Hi Xintong, > > Thanks for the prompt reply! To answer your question: > >- Which Flink version are you using? > >v1.8.2 > >- Is this skew observed on

Re: how to specify yarnqueue when starting a new job programmatically?

2020-03-11 Thread Xintong Song
Hi Vitaliy, You can specify a yarn queue by either setting the configuration option 'yarn.application.queue' [1], or using the command line option '-qu' (or '--queue') [2]. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-rel

Re: Flink 1.10 container memory configuration with Mesos.

2020-03-11 Thread Xintong Song
size' is missing. You can take a look at the launching command, see if there's anything unexpected before the memory dynamic configurations. Thank you~ Xintong Song On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo wrote: > Hi, Alexander > > I could not reproduce it in my local

Re: how to specify yarnqueue when starting a new job programmatically?

2020-03-12 Thread Xintong Song
e.g., in a Flink YARN Session.[1] Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/yarn_setup.html#flink-yarn-session On Thu, Mar 12, 2020 at 6:20 PM Vitaliy Semochkin wrote: > Thank you Xintong Song, > > is there any way to queue pr

Re: Flink on Kubernetes Vs Flink Natively on Kubernetes

2020-03-16 Thread Xintong Song
link Master will interact with Kubernetes Master, and actively requests for pods/containers, like on Yarn/Mesos. Thank you~ Xintong Song On Mon, Mar 16, 2020 at 4:03 PM Pankaj Chand wrote: > Hi all, > > I want to run Flink, Spark and other processing engines on a single > K

Re: Flink on Kubernetes Vs Flink Natively on Kubernetes

2020-03-16 Thread Xintong Song
Forgot to mention that "running Flink natively on Kubernetes" is newly introduced and is only available for Flink 1.10 and above. Thank you~ Xintong Song On Mon, Mar 16, 2020 at 5:40 PM Xintong Song wrote: > Hi Pankaj, > > "Running Flink on Kubernetes" refers

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-16 Thread Xintong Song
Flink 1.7 till the latest 1.10, and I'm not aware of any reported issue that the JM may not try to connect RM once the address is received. Thank you~ Xintong Song On Tue, Mar 17, 2020 at 7:45 AM Bajaj, Abhinav wrote: > Hi Xintong, > > > > Apologies for delayed response

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-17 Thread Xintong Song
I'm not familiar with ZK either. I've copied Yang Wang, who might be able to provide some suggestions. Alternatively, you can try to post your question to the Apache ZooKeeper community, see if they have any clue. Thank you~ Xintong Song On Wed, Mar 18, 2020 at 8:12 AM Bajaj, Abhi

Re: How can i set the value of taskmanager.network.numberOfBuffers ?

2020-03-20 Thread Xintong Song
Hi Forideal, Do you mean you have 700 slots per TM or in total? How many TMs do you have? And how many slots do you have per TM? Also, when is the screenshot taken? It is after the job is fully initiated? It seems you only need 1k+ network buffers. Thank you~ Xintong Song On Fri, Mar 20

Re: usae of ClusterSpecificationBuilder.taskManagerMemoryMB

2020-03-23 Thread Xintong Song
16 [1], which replaces masterMemoryMB with `jobmanager.memory.process.size`. That would also involve refactoring YarnClusterDescriptor, which is not in good shape (e.g. the method startAppMaster has more than 400 lines) and is closely coupled with ClusterSpecification. Thank you~ Xintong Song O

Re: ClusterSpecification and Configuration questions

2020-03-23 Thread Xintong Song
helpful to that end. In addition, would you be able to check the Yarn logs? See if the container requests are received and containers are allocated. Thank you~ Xintong Song On Tue, Mar 24, 2020 at 6:45 AM Vitaliy Semochkin wrote: > Hi, > > I create a job with following p

Re: [Third-party Tool] Flink memory calculator

2020-03-29 Thread Xintong Song
Thanks Yangze, I've tried the tool and I think its very helpful. Thank you~ Xintong Song On Mon, Mar 30, 2020 at 9:40 AM Yangze Guo wrote: > Hi, Yun, > > I'm sorry that it currently could not handle it. But I think it is a > really good idea and that feature woul

Re: [Third-party Tool] Flink memory calculator

2020-03-29 Thread Xintong Song
for a job cluster, but does not cover the scenarios of session clusters. Thank you~ Xintong Song On Mon, Mar 30, 2020 at 12:03 PM Yangze Guo wrote: > Thanks for your feedbacks, @Xintong and @Jeff. > > @Jeff > I think it would always be good to leverage exist logic in Flink, such >

Re: Question about the flink 1.6 memory config

2020-03-31 Thread Xintong Song
environment and workloads. For standalone clusters, the cut-off will not take any effect. For containerized environments, depending on Yarn/Mesos configurations your container may or may not get killed due to exceeding the container memory. Thank you~ Xintong Song On Tue, Mar 31, 2020 at 5:34 PM

Re: on YARN question

2020-04-10 Thread Xintong Song
d, including "-d". As a result, you're running the session cluster in attached mode, and the client will not exit until the session is shutdown. Thank you~ Xintong Song On Fri, Apr 10, 2020 at 1:10 PM Yangze Guo wrote: > Do you mean to run it in detach mode? If so, you could add

Re: Flink job consuming all available memory on host

2020-04-12 Thread Xintong Song
ny native memory? E.g., launch another process, calling a JNI library or so? Thank you~ Xintong Song On Sat, Apr 11, 2020 at 3:56 AM Mitch Lloyd wrote: > We are having an issue with a Flink Job that gradually consumes all > available memory on a Docker host machine, crashing the machin

Re: Flink On Yarn , ResourceManager is HA , if active ResourceManager changed,what is flink task status ?

2020-04-15 Thread Xintong Song
Normally, Yarn RM switch should not cause any problem to the running Flink instance. Unless the RM switch takes too long and Flink happens to request new containers during that time, it might lead to resource allocation timeout. Thank you~ Xintong Song On Wed, Apr 15, 2020 at 3:49 PM LakeShen

Re: Flink 1.10 Out of memory

2020-04-19 Thread Xintong Song
heap / direct memory. My suggestion is to try increasing the JVM overhead configuration. You can leverage the configuration options 'taskmanager.memory.jvm-overhead.[min|max|fraction]'. See more details in the documentation[1]. Thank you~ Xintong Song [1] https://ci.apache.org/pr

Re: A Strategy for Capacity Testing

2020-04-23 Thread Xintong Song
performance to get stabilized. Depends on your workload, this could take up to tens of minutes. Please also be careful with aggregations over large windows. The emitting of windows might introduce large processing workloads, fluctuating the measured throughput. Thank you~ Xintong Song On Thu, Apr 23

Re: Flink 1.10 Out of memory

2020-04-23 Thread Xintong Song
@Stephan, I don't think so. If JVM hits the direct memory limit, you should see the error message "OutOfMemoryError: Direct buffer memory". Thank you~ Xintong Song On Thu, Apr 23, 2020 at 6:11 PM Stephan Ewen wrote: > @Xintong and @Lasse could it be that the JVM hits

Re: IntelliJ java formatter

2020-04-23 Thread Xintong Song
Hi Flavio, I'm not aware of anyway to automatically format the codes. The only thing I find that might help is to enable your IDE with a checkstyle plugin. https://ci.apache.org/projects/flink/flink-docs-stable/flinkDev/ide_setup.html#checkstyle-for-java Thank you~ Xintong Song On Thu

Re: Flink 1.10 Out of memory

2020-04-24 Thread Xintong Song
ative method, I think the problem is that not enough native memory can be allocated for executing the native method. Thank you~ Xintong Song On Fri, Apr 24, 2020 at 3:40 PM Stephan Ewen wrote: > @Xintong - out of curiosity, where do you see that this tries to fork a > process?

Re: Flink 1.10 Out of memory

2020-04-24 Thread Xintong Song
True. Thanks for the clarification. Thank you~ Xintong Song On Fri, Apr 24, 2020 at 5:21 PM Stephan Ewen wrote: > I think native methods are not in a forked process. It is just a malloc() > call that failed, probably an I/O buffer or so. > This might mean that there really is

Re: Configuring taskmanager.memory.task.off-heap.size in Flink 1.10

2020-04-28 Thread Xintong Song
'task.off-heap.size' being 0 only represents that in most cases user codes / operators do not use off-heap memory. User would need to explicitly increase this configuration if UDFs or libraries of the job uses off-heap memory. Thank you~ Xintong Song On Wed, Apr 29, 2020 at 11:07 AM

Re: Flink Task Manager GC overhead limit exceeded

2020-04-29 Thread Xintong Song
tions look good to me. It the configured path '/dumps/oom.bin' a local path of the pod or a path of the host mounted onto the pod? The restarted pod is a completely new different pod. Everything you write to the old pod goes away as the pod terminated, unless they are written to the host

Re: Configuring taskmanager.memory.task.off-heap.size in Flink 1.10

2020-04-29 Thread Xintong Song
led by JVM. In Flink, managed memory and jvm-overhead are using native memory. That means, if you see a JVM OOM, increasing jvm-overhead should not help. Thank you~ Xintong Song On Thu, Apr 30, 2020 at 11:06 AM Jiahui Jiang wrote: > Hey Xintong, Steven, thanks for replies! > > @Steven W

Re: Flink Task Manager GC overhead limit exceeded

2020-04-29 Thread Xintong Song
ner". I suspect there might be some argument passing problem regarding the spaces and double quotation marks. Thank you~ Xintong Song On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin wrote: > Hi Xintong, > > Thanks for the detailed explanation! > > as for the 2nd question: I mou

Re: 1.11 snapshot: Name or service not knownname localhost and taskMgr not started

2020-04-29 Thread Xintong Song
Hi Lei, Could you check whether the hostname 'localhost' is available on your CentOS machine? This is usually defined in "/etc/hosts". You can also try to modify the slaves file, replacing 'localhost' with '127.0.0.1'. The path is: /conf/slaves Thank you~

Re: Configuring taskmanager.memory.task.off-heap.size in Flink 1.10

2020-04-29 Thread Xintong Song
se a few direct memory. But that's quite opportunistic. So it would be better to configure a non-zero task.off-heap if you know your tasks/operators use some direct memory. Thank you~ Xintong Song On Thu, Apr 30, 2020 at 12:14 PM Jiahui Jiang wrote: > Hey Xintong, thanks for the explanat

Re: No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-08 Thread Xintong Song
Linking to the jira ticket, for the record. https://issues.apache.org/jira/browse/FLINK-17560 Thank you~ Xintong Song On Sat, May 9, 2020 at 2:14 AM Josson Paul wrote: > Set up > -- > Flink verson 1.8.3 > > Zookeeper HA cluster > > 1 ResourceManager/Dispa

Re: Flink Memory analyze on AWS EMR

2020-05-11 Thread Xintong Song
Hi Jacky, Could you search for "Application Master start command:" in the debug log and post the result and a few lines before & after that? This is not included in the clip of attached log file. Thank you~ Xintong Song On Tue, May 12, 2020 at 5:33 AM Jacky D wrote: > hi,

Re: Flink Memory analyze on AWS EMR

2020-05-12 Thread Xintong Song
PREFIX}" with "/your-file-name.jit". The token "" should be replaced with proper log directory path by Yarn automatically. I noticed that the usage of ${FLINK_LOG_PREFIX} is recommended by Flink's documentation [1]. This is IMO a bit misleading. I'll try to file

Re: Flink BLOB server port exposed externally

2020-05-18 Thread Xintong Song
1.11.0 is feature freezing today. The final release date depends on the progress of release testing / bug fixing. Thank you~ Xintong Song On Mon, May 18, 2020 at 6:36 PM Omar Gawi wrote: > Thanks Till! > Do you know what is 1.11.0 release date? > > > On Mon, May 18, 2020 a

Re: Flink Dashboard UI Tasks hard limit

2020-05-22 Thread Xintong Song
lower parallelism. Could you share some more information about your use case? - What kind of job are your executing? Is it a streaming or batch processing job? - Which Flink deployment do you use? Standalone? Yarn? - It would be helpful if you can share the Flink logs. Thank you~ Xintong

Re: Flink Dashboard UI Tasks hard limit

2020-05-24 Thread Xintong Song
an argument for the `flink run` command, to set parallelism for all operators. - Set `parallelism.default` in your `flink-conf.yaml`, to set a default parallelism for your jobs. This will be used for jobs that have not set parallelism with neither of the above methods. Thank you~ Xintong So

Re: Flink Dashboard UI Tasks hard limit

2020-05-26 Thread Xintong Song
t the execution plan only shows 5. Thank you~ Xintong Song On Wed, May 27, 2020 at 3:16 AM Vijay Balakrishnan wrote: > Hi Xintong, > Thanks for the excellent clarification for tasks. > > I attached a sample screenshot above and din't reflect the slots used and > the tasks li

Re: Flink Dashboard UI Tasks hard limit

2020-05-27 Thread Xintong Song
etwork_fraction, network_min), network_max)`. According to the error message, your current network memory size is `85922 buffers * 32KB/buffer = 2685MB`, smaller than your "max" (4gb). That means increasing the "max" does not help in your case. It is the "fraction" that you

Re: Flink Dashboard UI Tasks hard limit

2020-05-31 Thread Xintong Song
.NioEventLoop >> .processSelectedKeys(NioEventLoop.java:508) >> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop >> .run(NioEventLoop.java:470) >> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >> SingleThreadEventExecutor$5.run(SingleThre

Re: Flink Dashboard UI Tasks hard limit

2020-06-04 Thread Xintong Song
ould need to look into the *log of the task manager that is not responding* to understand what's wrong with it. Thank you~ Xintong Song On Fri, Jun 5, 2020 at 6:06 AM Vijay Balakrishnan wrote: > Thx a ton, Xintong. > I am using this configuration now: > taskman

Re: Flink on yarn : yarn-session understanding

2020-06-08 Thread Xintong Song
ing only one job. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/cluster_setup.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/concepts/glossary.html#flink-application-cluster [3] https://ci.apache.org/projects/flink

Re: Dynamic rescaling in Flink

2020-06-09 Thread Xintong Song
dynamically adapt to the available resources (e.g., add/reduce pods on kubernetes). AFAIK, this is still in the design discussion. Thank you~ Xintong Song On Wed, Jun 10, 2020 at 2:44 AM Prasanna kumar < prasannakumarram...@gmail.com> wrote: > Hi all, > > Does flink support dynamic s

Re: The network memory min (64 mb) and max (1 gb) mismatch

2020-06-11 Thread Xintong Song
igurations will be read by Flink task manager so that memory will be managed accordingly. Flink task manager expects all the memory configurations are already set (thus network min/max should have the same value) before it's started. In your case, it seems such configurations are missin

Re: Insufficient number of network buffers- what does Total mean on the Flink Dashboard

2020-06-11 Thread Xintong Song
jvmHeap = (total - Max(cutoff-min, total * cutoff-ratio)) * (1 - networkFraction) = (102GB - Max(600MB, 102GB * 0.25)) * (1 - 0.48) = 40.6GB Have you specified a custom "-Xmx" parameter? Thank you~ Xintong Song On Fri, Jun 12, 2020 at 7:50 AM Vijay Balakrishnan wrote: > Hi

Re: Flink 1.10.1 not using FLINK_TM_HEAP for TaskManager JVM Heap size correctly?

2020-06-12 Thread Xintong Song
he configuration option but not for the environment variable) > The previous options which were responsible for the total memory used by > Flink are taskmanager.heap.size or taskmanager.heap.mb. Despite their > naming, they included not only JVM heap but also other off-heap memory > compon

Re: Flink 1.10.1 not using FLINK_TM_HEAP for TaskManager JVM Heap size correctly?

2020-06-12 Thread Xintong Song
you~ Xintong Song On Fri, Jun 12, 2020 at 4:27 PM Xintong Song wrote: > Hi Li, > > FLINK_TM_HEAP corresponds to the legacy configuration option > "taskmanager.heap.size". It is supported for backwards compatibility. I > strongly recommend you to use "

Re: Insufficient number of network buffers- what does Total mean on the Flink Dashboard

2020-06-12 Thread Xintong Song
-Xmx on Mesos. BTW, from your screenshot the physical memory is 123GB, so 1/4 of that is much closer to 29GB if we consider there are some rounding errors and accuracy loss. Thank you~ Xintong Song On Fri, Jun 12, 2020 at 4:33 PM Vijay Balakrishnan wrote: > Thx, Xintong for a great

Re: The network memory min (64 mb) and max (1 gb) mismatch

2020-06-12 Thread Xintong Song
leverage the configuration option "taskmanager.memory.task.heap.size", and an additional constant framework overhead will be added to this value for -Xmx. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#jvm-parameters O

Re: Insufficient number of network buffers- what does Total mean on the Flink Dashboard

2020-06-12 Thread Xintong Song
l whether "env.java.opts" works for you. Thank you~ Xintong Song On Fri, Jun 12, 2020 at 5:33 PM Vijay Balakrishnan wrote: > Hi Xintong, > Just to be clear. I haven't set any -Xmx -i will check our scripts again. > Assuming no -Xmx is set, the doc above says 1/4 of

Re: The network memory min (64 mb) and max (1 gb) mismatch

2020-06-12 Thread Xintong Song
Yes, that is correct. 'taskmanager.memory.process.size' is the most recommended. Thank you~ Xintong Song On Fri, Jun 12, 2020 at 10:59 PM Clay Teeter wrote: > Ok, this is great to know. So in my case; I have a k8 pod that has a > limit of 4Gb. I should remove the -Xmx and

Re: Dynamic rescaling in Flink

2020-06-14 Thread Xintong Song
single job mode. The session mode is not supported. But I haven't checked this for quite a while. It could have been changed. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/yarn_setup.html#run-a-single-flink-job-on-yarn [2] https://ci.apach

Re: [ANNOUNCE] Yu Li is now part of the Flink PMC

2020-06-16 Thread Xintong Song
Congratulations Yu, well deserved~! Thank you~ Xintong Song On Wed, Jun 17, 2020 at 9:15 AM jincheng sun wrote: > Hi all, > > On behalf of the Flink PMC, I'm happy to announce that Yu Li is now > part of the Apache Flink Project Management Committee (PMC). > > Yu Li

Re: Heartbeat of TaskManager timed out.

2020-06-27 Thread Xintong Song
not timely handled before the timeout check. - Is there any metrics monitoring the network condition between the JM and timeouted TM? Possibly any jitters? Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html#heartbeat-timeout On Thu

Re: Optimal Flink configuration for Standalone cluster.

2020-06-27 Thread Xintong Song
o set `.task.heap.size` and `managed.size`. 2. If you don't know how many heap/managed memory to configure, you can look for the configuration options in the beginning of the TM logs (`-Dkey=value`). Those are the values derived from your current configuration. Thank you~ Xi

Re: Heartbeat of TaskManager timed out.

2020-06-28 Thread Xintong Song
n guide [1]. Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html On Sun, Jun 28, 2020 at 10:12 PM Ori Popowski wrote: > Thanks for the suggestions! > > > i recently tried 1.10 and see this error frequently. and

Re: Optimal Flink configuration for Standalone cluster.

2020-06-28 Thread Xintong Song
sk managers (say tens of GBs) unless absolutely necessary. Alternatively, you can try to launch multiple TMs on one physical machine, to reduce the memory size of each TM process. BTW, what kind of workload are you running? Is it streaming or batch? Thank you~ Xintong Song On Mon, Jun 29, 20

  1   2   3   >