I’m seeing sporadic issues where it appears that curator (or other) user threads are left running after a stream shutdown, and then the user class loader goes away and I get spammed with ClassNotFoundExceptions… I’m wondering if this might have something to do with perhaps the UserClassLoader being shut down before close is invoked on all operators?
Here’s a stack trace I see from an attempt at closing an elastic search sink: java.lang.ClassNotFoundException: com.intellify.flink.shaded.curator.org.apache.curator.utils.CloseableUtils at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:128) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.intellify.flink.shaded.curator.org.apache.curator.ConnectionState.close(ConnectionState.java:119) at com.intellify.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.close(CuratorZookeeperClient.java:227) at com.intellify.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.close(CuratorFrameworkImpl.java:381) at com.intellify.config.ManagedCuratorFramework.reallyClose(ManagedCuratorFramework.java:48) at com.intellify.config.ArchaiusInitializer.close(ArchaiusInitializer.java:75) at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:303) at com.intellify.flink.shared.config.SharedIntellifyConfigProvider.close(SharedIntellifyConfigProvider.java:56) at com.intellify.flink.shared.config.SerializableLiveProperty.close(SerializableLiveProperty.java:68) at com.intellify.flink.shared.elasticsearch.LiveResolvingEs1ApiCallBridge.cleanup(LiveResolvingEs1ApiCallBridge.java:105) at org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase.close(ElasticsearchSinkBase.java:323) at com.intellify.flink.shared.tracer.TracingSink.close(TracingSink.java:50) at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:446) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:351) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) at java.lang.Thread.run(Thread.java:748) I’m using a curator connection for archaius, and closing it in the call bridge’s cleanup method. I’m ensuring that I’m not reaching up into the parent class loader by shading curator and zookeeper. I also see the following on repeat in my task manager log: 2018-01-11 14:53:13.313 [heartbeat-filter -> input-trace-filter -> filter-inactive-ds -> filter-duplicates (ip-10-80-53-99.us-west-2.compute.internal:2181)] WARN c.i.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session 0x3c00002d8a7603de for server ip-10-80-53-99.us-west-2.compute.internal/10.80.53.99:2181, unexpected error, closing socket connection and attempting reconnect java.lang.NoClassDefFoundError: com/intellify/flink/shaded/zookeeper/org/apache/zookeeper/proto/SetWatches at com.intellify.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn$SendThread.primeConnection(ClientCnxn.java:926) at com.intellify.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363) at com.intellify.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) Does anyone have any insight into what might be happening here? Does this seem like I’m not closing a thread properly, or something else entirely? -- Jared Stehler Chief Architect - Intellify Learning o: 617.701.6330 x703