Thanks Yingjie for driving.

It is very useful to have this check list.
I think we can list all problematic third-party libraries.
Including hadoop jar:
org.apache.hadoop.fs.FileSystem.StatisticsDataReferenceCleaner.

Because there are too many libraries with this problem. And our Yarn mode
perJob can alleviate this problem. So I think we are just suggesting. No
need to force user not writing these codes or using these third-party
libraries.

Best,
Jingsong Lee

On Wed, Dec 18, 2019 at 4:55 PM Yingjie Cao <kevin.ying...@gmail.com> wrote:

> I'd like to do that.
>
> Best,
> Yingjie
>
> Till Rohrmann <trohrm...@apache.org> 于2019年12月18日周三 下午4:48写道:
>
> > I think we should add this check list to the coding guidelines and
> continue
> > extending it there. Do you wanna update the coding guidelines accordingly
> > Yingjie?
> >
> > Cheers,
> > Till
> >
> > On Wed, Dec 18, 2019 at 8:21 AM Yingjie Cao <kevin.ying...@gmail.com>
> > wrote:
> >
> > > Hi Till & Biao,
> > >
> > > Thanks for the reply.
> > >
> > > I agree that supplying some stress or stability tests can really help,
> > > except for the jvm resource leak mentioned above, there may be other
> type
> > > of resource leak like slot or network buffer leak. In addition, other
> > tests
> > > like triggering failover in various different ways, stressing the
> system
> > > with high parallelism and heavy load jobs and running jobs or
> triggering
> > > failover over and over again can also help. I think stress or stability
> > > tests is a big topic and resource leak checking can be a good start.
> > >
> > > As the start of resource leak checking, we may need to collect a check
> > list
> > > which can also help to troubleshoot resource leak problem manually.
> From
> > my
> > > previous experience, I can think of the following ones:
> > > 1. File#deleteOnExit hook leaks string of file path. Flink rest server
> > once
> > > suffered from the problem and it has been fixed currently.
> > > 2. Thread leak. OrcInputFormat suffers from this.
> > > 3. ApplicationShutDownHook reference user classes.
> > > 4. ClassLoader#parallelLockMap may leak because of too many generated
> > > classes. Flink also suffers from this problem and the issue is reported
> > in
> > > FLINK-15024 and need to be resolved.
> > > 5. Some other static fields (like caches implemented by map) of classes
> > > loaded by system class loader also have a potential of resource leak.
> > >
> > > Any other supplementation to this check list is welcomed. And even with
> > > this checklist, its may not trivial to do the check, dumping and
> > analysing
> > > the heap may be a choice. I will do some future survey about that.
> > >
> > > Best,
> > > Yingjie
> > >
> > > Biao Liu <mmyy1...@gmail.com> 于2019年12月17日周二 下午9:02写道:
> > >
> > > > Hi Yingjie,
> > > >
> > > > Thanks for figuring out the impressive bug and bringing this
> > discussion.
> > > >
> > > > I'm afraid there is no such a silver bullet for isolation from
> > > third-party
> > > > library. However I agree that resource checking utils might help.
> > > > It seems that you and Till have already raised some feasible ideas.
> > > > Resource leaking issue looks like quite common. It would be great If
> > > > someone could share some experience. Will keep an eye on this
> > discussion.
> > > >
> > > > Thanks,
> > > > Biao /'bɪ.aʊ/
> > > >
> > > >
> > > >
> > > > On Tue, 17 Dec 2019 at 20:27, Till Rohrmann <trohrm...@apache.org>
> > > wrote:
> > > >
> > > > > Hi Yingjie,
> > > > >
> > > > > thanks for reporting this issue and starting this discussion. If we
> > are
> > > > > dealing with third party libraries I believe there is always the
> risk
> > > > that
> > > > > one overlooks closing resources. Ideally we make it as hard from
> > > Flink's
> > > > > perspective as possible but realistically it is hard to completely
> > > avoid.
> > > > > Hence, I believe that it would be beneficial to have some tooling
> > (e.g.
> > > > > stress tests) which could help to surface these kind of problems.
> > Maybe
> > > > one
> > > > > could automate it so that a dev only needs to provide a user jar
> and
> > > then
> > > > > this jar is being executed several times and the cluster is checked
> > for
> > > > > anomalies.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Dec 17, 2019 at 8:43 AM Yingjie Cao <
> kevin.ying...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi community,
> > > > > >
> > > > > >   After running tpc-ds test suite for several days on a session
> > > > cluster,
> > > > > we
> > > > > > found a resource leak problem of OrcInputFormat which was
> reported
> > in
> > > > > > FLINK-15239. The problem comes from the dependent third party
> > library
> > > > > which
> > > > > > creates new internal thread (pool) and never release it. As a
> > result,
> > > > the
> > > > > > user class loader which is referenced by these threads will never
> > be
> > > > > > garbage collected as well as other classes loaded by the user
> class
> > > > > loader,
> > > > > > which finally lead to the continually grow of meta space size for
> > JM
> > > > (AM)
> > > > > > whose meta space size is not limited currently. And for TM whose
> > meta
> > > > > space
> > > > > > size is limited, it will result in meta space oom eventually. I
> am
> > > not
> > > > > sure
> > > > > > if any other connectors/input formats incurs the similar problem.
> > > > > >   In general, it is hard for Flink to restrict the behavior of
> the
> > > > third
> > > > > > party dependencies, especially the dependencies of connectors.
> > > However,
> > > > > it
> > > > > > will be better if we can supply some mechanism like stronger
> > > isolation
> > > > or
> > > > > > some test facilities to find potential problems, for example, we
> > can
> > > > run
> > > > > > jobs on a cluster and automatically check something like whether
> > user
> > > > > class
> > > > > > loader can be garbage collected, whether there is thread leak,
> > > whether
> > > > > some
> > > > > > shutdown hooks have been registered and so on.
> > > > > >   What do you think? Or should we treat it as a problem?
> > > > > >
> > > > > > Best,
> > > > > > Yingjie
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best, Jingsong Lee

Reply via email to