On 2019/10/07 10:18:43, Rémy Maucherat <r...@apache.org> wrote: 
> On Mon, Oct 7, 2019 at 11:15 AM Emmanuel Lecharny <elecha...@apache.org>
> wrote:
> 
> >
> >
> > On 2019/10/05 11:12:46, Rémy Maucherat <r...@apache.org> wrote:
> > > On Fri, Oct 4, 2019 at 10:38 PM Emmanuel Lecharny <elecha...@apache.org>
> > > wrote:
> > >
> > > > Hi remy,
> > > >
> > > > On 2019/10/04 15:37:36, Rémy Maucherat <r...@apache.org> wrote:
> > > > > On Fri, Oct 4, 2019 at 3:40 PM Emmanuel Lecharny <
> > elecha...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hi !
> > > > > >
> > > > > > I filled a ticket yesterday about a pb we face with many NIO
> > framework,
> > > > > > which I think could hit Tomcat too (see
> > > > > > https://bz.apache.org/bugzilla/show_bug.cgi?id=63802). Actually, I
> > > > think
> > > > > > I'm facing this problem on a project I'm working on atm.
> > > > > >
> > > > > > Remy suggested we discuss it on this mailing list.
> > > > > >
> > > > > > Bottom line, what happens is that under some circumstances not well
> > > > > > defined, the call to select() might end to an infinite loop eating
> > all
> > > > the
> > > > > > CPU (select() returns 0, so select is immediately called again,
> > and we
> > > > > > loop).
> > > > > >
> > > > > > In various NIO framworks - and being a MINA committer, I have
> > > > implemented
> > > > > > the discussed workaround -, we are controlling this situation by
> > > > breaking
> > > > > > this infinite loop this way :
> > > > > > - if the select() call returns 0
> > > > > > - then if we have called select() more than N times in less than M
> > ms
> > > > > > (N=10, M=100 in MINA)
> > > > > > - then we create a new Selector, register all the selectionKey that
> > > > were
> > > > > > registered on the broken selector, and ditch the old selector.
> > > > > >
> > > > > > This workaround does not cost a lot when the selector works as
> > > > designed,
> > > > > > as a select() call should never return 0.
> > > > > >
> > > > >
> > > > > There's actually a very similar hack for APR that has been placed by
> > > > myself
> > > > > a long time ago [
> > > > >
> > > >
> > https://github.com/apache/tomcat/blob/master/java/org/apache/tomcat/util/net/AprEndpoint.java#L1410
> > > > > ], I don't even know if it's actually useful and it's certainly not
> > > > > testable. Overall what it does is pretty terrible :(
> > > > >
> > > > > Personally I would like to know more about this "long lived bug
> > either in
> > > > > the JDK or even in Linux epoll implementation" like actual platform
> > > > details
> > > > > and JVM versions used since I've never heard about it in the first
> > place.
> > > >
> > > > for the record, I had a discussion yesterday with one of my close
> > friend
> > > > and co-worker back in the 90's. He remember clearly, while working on
> > the
> > > > SUN TCP stack,  that such a problem occorded back then. Yes, 25 years
> > > > ago... Ok, that was just for the fun, it's likely be perfectly
> > unrelated ;-)
> > > >
> > > > At MINA, we were hit by this bug in 2009 (see
> > > > https://issues.apache.org/jira/browse/DIRMINA-678), and it was linked
> > to
> > > > a bug reported on Jetty (
> > > >
> > http://jetty.4.x6.nabble.com/jira-Created-JETTY-937-SelectChannelConnector-100-CPU-usage-on-Linux-td36385.html
> > ),
> > > > itself related to some JDK bugs, supposedly fixed since then.
> > > >
> > > > I had a long conversation with Jean-François Arcand somewhere around
> > this
> > > > date, and he suggested we adopt the same workaround he applied to
> > Grizzly.
> > > > We also had a convo with Alan Bateman during a Java One in SF, but
> > nothing
> > > > specific resulted from this convo, except that AFAICR, he aknowledge
> > there
> > > > is an issue.
> > > >
> > > > So this problem started with JDK 6, but I can't guarantee it wasn't
> > > > already present in JDK 5 or 4, on linux, and not on any other OS like
> > > > windows or Mac OSX. It's not exactly fresh in my mind, because it was
> > > > already 10 years ago.
> > > >
> > >
> > > NIO support was added in Tomcat 6.0, supporting Java 5+, it wasn't very
> > > good then. It's only with Java 6 that NIO started getting epoll support
> > ant
> > > I'm pretty sure the original issue did not actually survive. Despite the
> > > popularity of the NIO connector this was not reported for Tomcat, if we
> > got
> > > the report at the same time as the others it would be more logical so
> > > something is different here.
> > > https://github.com/netty/netty/issues/327 has details but I'm still not
> > > very convinced. You should give details on your platform and everything
> > > else since it's obvious at this point this is far less common with
> > Tomcat.
> >
> > There is not much I can tell about this issue, beside what I already said.
> > I can just stress out that for a few users of MINA, this was a real burden,
> > and the very same for Netty, Grizzly and Jetty. I would be *very* surprised
> > that those four different projects, all based on NIO, are facing such an
> > issue, but that Tomcat is immune to it.
> >
> 
> One person on the Netty issue I linked reported it on Tomcat, that's the
> only one I could fine so it's far less common. It could still be useful to
> give info on the platform (was Java 11 and a recent Linux like RHEL8/Fedora
> tested ?) 

It's not about testing, the issue is pretty random. Enough said that it happens 
in Java 8, as it happened in previous versions of Java. As you said later on in 
our response, that it happens or not in Java 11 is quite independent, as soon 
as it happens in Java 8. It's pretty hard to make users to switch to a newer 
version of Java, especially those who pay support to Oracle (OTOH, if they pay 
Oracle, and if it's a JDK bug...)

and use pattern. If the issue still happens, I think this needs
> to be reported with OpenJDK (with details since it needs to be reproducible
> ...).

If only it were reproducible... We don't know what triggers this behavior, we 
just know that we have a working workaround.

Also unless we can exhibit this behavior by logging the spinning select() - 
something we do in MINA -, there is no much to show: consecutive thread dumps 
just show that one thread is always active on select(), but this is expected no 
matter what (except that it should be waiting, not being active). It only 
proves that there is no other thread eating CPU. Also the stack ends in a 
native method (sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)), so there 
is nothing much we can tell about what's going on there. May be a flame graph 
could help as an evidence ?

> 
> 
> >
> > > You should try the NIO2 connector first.
> >
> > I'll do that right away. if it fixes the 100% CPU usage I see from time to
> > time, then I would consider the issue resolved (there is no mean to
> > workaround something in the NIO code if NIO2 solves it...)
> >
> 
> Well, the main point is to know the behavior of NIO2 and that's it, what
> happens with NIO is independent.

Indeed.
Emmanuel

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to