Thanks Yingjie for driving. It is very useful to have this check list. I think we can list all problematic third-party libraries. Including hadoop jar: org.apache.hadoop.fs.FileSystem.StatisticsDataReferenceCleaner.
Because there are too many libraries with this problem. And our Yarn mode perJob can alleviate this problem. So I think we are just suggesting. No need to force user not writing these codes or using these third-party libraries. Best, Jingsong Lee On Wed, Dec 18, 2019 at 4:55 PM Yingjie Cao <kevin.ying...@gmail.com> wrote: > I'd like to do that. > > Best, > Yingjie > > Till Rohrmann <trohrm...@apache.org> 于2019年12月18日周三 下午4:48写道: > > > I think we should add this check list to the coding guidelines and > continue > > extending it there. Do you wanna update the coding guidelines accordingly > > Yingjie? > > > > Cheers, > > Till > > > > On Wed, Dec 18, 2019 at 8:21 AM Yingjie Cao <kevin.ying...@gmail.com> > > wrote: > > > > > Hi Till & Biao, > > > > > > Thanks for the reply. > > > > > > I agree that supplying some stress or stability tests can really help, > > > except for the jvm resource leak mentioned above, there may be other > type > > > of resource leak like slot or network buffer leak. In addition, other > > tests > > > like triggering failover in various different ways, stressing the > system > > > with high parallelism and heavy load jobs and running jobs or > triggering > > > failover over and over again can also help. I think stress or stability > > > tests is a big topic and resource leak checking can be a good start. > > > > > > As the start of resource leak checking, we may need to collect a check > > list > > > which can also help to troubleshoot resource leak problem manually. > From > > my > > > previous experience, I can think of the following ones: > > > 1. File#deleteOnExit hook leaks string of file path. Flink rest server > > once > > > suffered from the problem and it has been fixed currently. > > > 2. Thread leak. OrcInputFormat suffers from this. > > > 3. ApplicationShutDownHook reference user classes. > > > 4. ClassLoader#parallelLockMap may leak because of too many generated > > > classes. Flink also suffers from this problem and the issue is reported > > in > > > FLINK-15024 and need to be resolved. > > > 5. Some other static fields (like caches implemented by map) of classes > > > loaded by system class loader also have a potential of resource leak. > > > > > > Any other supplementation to this check list is welcomed. And even with > > > this checklist, its may not trivial to do the check, dumping and > > analysing > > > the heap may be a choice. I will do some future survey about that. > > > > > > Best, > > > Yingjie > > > > > > Biao Liu <mmyy1...@gmail.com> 于2019年12月17日周二 下午9:02写道: > > > > > > > Hi Yingjie, > > > > > > > > Thanks for figuring out the impressive bug and bringing this > > discussion. > > > > > > > > I'm afraid there is no such a silver bullet for isolation from > > > third-party > > > > library. However I agree that resource checking utils might help. > > > > It seems that you and Till have already raised some feasible ideas. > > > > Resource leaking issue looks like quite common. It would be great If > > > > someone could share some experience. Will keep an eye on this > > discussion. > > > > > > > > Thanks, > > > > Biao /'bɪ.aʊ/ > > > > > > > > > > > > > > > > On Tue, 17 Dec 2019 at 20:27, Till Rohrmann <trohrm...@apache.org> > > > wrote: > > > > > > > > > Hi Yingjie, > > > > > > > > > > thanks for reporting this issue and starting this discussion. If we > > are > > > > > dealing with third party libraries I believe there is always the > risk > > > > that > > > > > one overlooks closing resources. Ideally we make it as hard from > > > Flink's > > > > > perspective as possible but realistically it is hard to completely > > > avoid. > > > > > Hence, I believe that it would be beneficial to have some tooling > > (e.g. > > > > > stress tests) which could help to surface these kind of problems. > > Maybe > > > > one > > > > > could automate it so that a dev only needs to provide a user jar > and > > > then > > > > > this jar is being executed several times and the cluster is checked > > for > > > > > anomalies. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Tue, Dec 17, 2019 at 8:43 AM Yingjie Cao < > kevin.ying...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > Hi community, > > > > > > > > > > > > After running tpc-ds test suite for several days on a session > > > > cluster, > > > > > we > > > > > > found a resource leak problem of OrcInputFormat which was > reported > > in > > > > > > FLINK-15239. The problem comes from the dependent third party > > library > > > > > which > > > > > > creates new internal thread (pool) and never release it. As a > > result, > > > > the > > > > > > user class loader which is referenced by these threads will never > > be > > > > > > garbage collected as well as other classes loaded by the user > class > > > > > loader, > > > > > > which finally lead to the continually grow of meta space size for > > JM > > > > (AM) > > > > > > whose meta space size is not limited currently. And for TM whose > > meta > > > > > space > > > > > > size is limited, it will result in meta space oom eventually. I > am > > > not > > > > > sure > > > > > > if any other connectors/input formats incurs the similar problem. > > > > > > In general, it is hard for Flink to restrict the behavior of > the > > > > third > > > > > > party dependencies, especially the dependencies of connectors. > > > However, > > > > > it > > > > > > will be better if we can supply some mechanism like stronger > > > isolation > > > > or > > > > > > some test facilities to find potential problems, for example, we > > can > > > > run > > > > > > jobs on a cluster and automatically check something like whether > > user > > > > > class > > > > > > loader can be garbage collected, whether there is thread leak, > > > whether > > > > > some > > > > > > shutdown hooks have been registered and so on. > > > > > > What do you think? Or should we treat it as a problem? > > > > > > > > > > > > Best, > > > > > > Yingjie > > > > > > > > > > > > > > > > > > > > > -- Best, Jingsong Lee