Re: Task Managers having trouble registering after restart

2021-08-24 Thread Chesnay Schepler
There's a super rough guide in the wiki: https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks The gist of it is that you first want to verify that a ChildFirstClassLoader is being leaked (i.e., run a few jobs, cancel them, trigger garbage collection, get heap dump, che

Re: Task Managers having trouble registering after restart

2021-08-24 Thread Kevin Lam
Thank you for pulling in Chesnay. I haven't been able to confirm the issue doesn't happen yet, as I've found it difficult to reproduce easily. I did have follow-up questions: 1/ If Kafka metrics are indeed the cause of the leak, is there a workaround? We'd be interested in having these metrics av

Re: Task Managers having trouble registering after restart

2021-08-24 Thread Arvid Heise
Hi Kevin, The metrics are exposed similarly, so I expect the same issues as they come from Kafka's Consumer API itself. I'll pull in @Chesnay Schepler who afaik debugged the leak a while ago. On Mon, Aug 23, 2021 at 9:24 PM Kevin Lam wrote: > Actually, we are using the `FlinkKafkaConsumer` [0

Re: Task Managers having trouble registering after restart

2021-08-23 Thread Kevin Lam
Actually, we are using the `FlinkKafkaConsumer` [0] rather than `KafkaSource`. Is there a way to disable the consumer metrics using `FlinkKafkaConsumer`? Do you expect that to have the same Metaspace issue? [0] https://ci.apache.org/projects/flink/flink-docs-release-1.13/api/java/org/apache/flink

Re: Task Managers having trouble registering after restart

2021-08-23 Thread Kevin Lam
Thanks Arvid! I will give this a try and report back. On Mon, Aug 23, 2021 at 11:07 AM Arvid Heise wrote: > Hi Kevin, > > "java.lang.OutOfMemoryError: Metaspace" indicates that too many classes > have been loaded. [1] > If you only see that after a while, it's indicating that there is a > classl

Re: Task Managers having trouble registering after restart

2021-08-23 Thread Arvid Heise
Hi Kevin, "java.lang.OutOfMemoryError: Metaspace" indicates that too many classes have been loaded. [1] If you only see that after a while, it's indicating that there is a classloader leak. I suspect that this is because of Kafka metrics. There have been some reports in the past. You can try to se

Task Managers having trouble registering after restart

2021-08-17 Thread Kevin Lam
Hi all, I'm observing an issue sometimes, and it's been hard to reproduce, where task managers are not able to register with the Flink cluster. We provision only the number of task managers required to run a given application, and so the absence of any of the task managers causes the job to enter