Hi Yingjie, Thanks for figuring out the impressive bug and bringing this discussion.
I'm afraid there is no such a silver bullet for isolation from third-party library. However I agree that resource checking utils might help. It seems that you and Till have already raised some feasible ideas. Resource leaking issue looks like quite common. It would be great If someone could share some experience. Will keep an eye on this discussion. Thanks, Biao /'bɪ.aʊ/ On Tue, 17 Dec 2019 at 20:27, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Yingjie, > > thanks for reporting this issue and starting this discussion. If we are > dealing with third party libraries I believe there is always the risk that > one overlooks closing resources. Ideally we make it as hard from Flink's > perspective as possible but realistically it is hard to completely avoid. > Hence, I believe that it would be beneficial to have some tooling (e.g. > stress tests) which could help to surface these kind of problems. Maybe one > could automate it so that a dev only needs to provide a user jar and then > this jar is being executed several times and the cluster is checked for > anomalies. > > Cheers, > Till > > On Tue, Dec 17, 2019 at 8:43 AM Yingjie Cao <kevin.ying...@gmail.com> > wrote: > > > Hi community, > > > > After running tpc-ds test suite for several days on a session cluster, > we > > found a resource leak problem of OrcInputFormat which was reported in > > FLINK-15239. The problem comes from the dependent third party library > which > > creates new internal thread (pool) and never release it. As a result, the > > user class loader which is referenced by these threads will never be > > garbage collected as well as other classes loaded by the user class > loader, > > which finally lead to the continually grow of meta space size for JM (AM) > > whose meta space size is not limited currently. And for TM whose meta > space > > size is limited, it will result in meta space oom eventually. I am not > sure > > if any other connectors/input formats incurs the similar problem. > > In general, it is hard for Flink to restrict the behavior of the third > > party dependencies, especially the dependencies of connectors. However, > it > > will be better if we can supply some mechanism like stronger isolation or > > some test facilities to find potential problems, for example, we can run > > jobs on a cluster and automatically check something like whether user > class > > loader can be garbage collected, whether there is thread leak, whether > some > > shutdown hooks have been registered and so on. > > What do you think? Or should we treat it as a problem? > > > > Best, > > Yingjie > > >