Nico Kruber created FLINK-25023:
-----------------------------------

             Summary: ClassLoader leak on JM/TM through indirectly-started 
Hadoop threads out of user code
                 Key: FLINK-25023
                 URL: https://issues.apache.org/jira/browse/FLINK-25023
             Project: Flink
          Issue Type: Bug
          Components: Connectors / FileSystem, Connectors / Hadoop 
Compatibility, FileSystems
    Affects Versions: 1.13.3, 1.12.5, 1.14.0
            Reporter: Nico Kruber


If a Flink job is using HDFS through Flink's filesystem abstraction (either on 
the JM or TM), that code may actually spawn a few threads, e.g. from static 
class members:
 * {{org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner}}
 * {{IPC Parameter Sending Thread#*}}

These threads are started as soon as the classes are loaded which may be in the 
context of the user code. In this specific scenario, however, the created 
threads may contain references to the context class loader (I did not see that 
though) or, as happened here, it may inherit thread contexts such as the 
{{ProtectionDomain}} (from an {{{}AccessController{}}}).

Hence user contexts and user class loaders are leaked into long-running threads 
that are run in Flink's (parent) classloader.

Fortunately, it seems to only *leak a single* {{ChildFirstClassLoader}} in this 
concrete example but that may depend on which code paths each client execution 
is walking.

 

A *proper solution* doesn't seem so simple:
 * We could try to proactively initialize available file systems in the hope to 
start all threads in the parent classloader with parent context.
 * We could create a default {{ProtectionDomain}} for spawned threads as 
discussed at [https://dzone.com/articles/javalangoutofmemory-permgen], however, 
the {{StatisticsDataReferenceCleaner}} isn't actually actively spawned from any 
callback but as a static variable and this with the class loading itself (but 
maybe this is still possible somehow).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to