[Redirecting to net-dev, nio-dev] Martin
On Tue, Jul 21, 2009 at 12:52, Ariel Weisberg <ar...@weisberg.ws> wrote: > Hi all, > > It tooks a while for me to convince ourselves that this wasn't an > application problem. I am attaching a test case that reliably reproduces the > dead socket problem on some systems. The flow is essentially the same as the > networking code in our messaging system. > > I had the best luck reproducing this on Dell Poweredge 2970s (two socket > AMD) running CentOS 5.3. I dual booted two of them with Ubuntu server 9.04 > and have not succeded in reproducing the problem with Ubuntu. I was not able > to reproduce the problem on the Dell R610 (2 socket Nehalem) machines > running CentOS 5.3 with the test application although the actual app > (messaging system) does have this issue on the 610s. > > I am very interested in hearing about what happens when other people run > it. I am also interested in confirming that this is a sane use of Selectors, > SocketChannels, and SelectionKeys. > > Thanks, > Ariel Weisberg > > On Wed, 15 Jul 2009 14:24 -0700, "Martin Buchholz" <marti...@google.com> > wrote: > > In summary, > there are two different bugs at work here, > and neither of them is in LBD. > The hotspot team is working on the LBD deadlock. > (As always) It would be good to have a good test case for > the dead socket problem. > > Martin > > On Wed, Jul 15, 2009 at 12:24, Ariel Weisberg <ar...@weisberg.ws> wrote: > >> Hi, >> >> I have found that there are two different failure modes without involving >> -XX:+UseMembar. There is the LBD deadlock and then there is the dead socket >> in between two nodes. Either failure can occur with the same code and >> settings. It appears that the dead socket problem is more common. The LBD >> failure is also not correlated with any specific LBD (originally saw it with >> only the LBD for an Initiator's mailbox). >> >> With -XX:+UseMembar the system is noticeably more reliable and tends to >> run much longer without failing (although it can still fail immediately). >> When it does fail it has been due to a dead connection. I have not >> reproduced a deadlock on an LBD with -XX:+UseMembar. >> >> I also found that the dead socket issue was reproducible twice on Dell >> Poweredge 2970s (two socket AMD). It takes an hour or so to reproduce the >> dead socket problem on the 2970. I have not recreated the LBD issue on them >> although given how difficult the socket issue is to reproduce it may be that >> I have not run them long enough. On the AMD machines I did not use >> -XX:+UseMembar. >> >> Ariel >> >> >> On Mon, 13 Jul 2009 18:59 -0400, "Ariel Weisberg" <ar...@weisberg.ws> >> wrote: >> >> Hi all. >> >> Sorry Martin I missed reading your last email. I am not confident that I >> will get a small reproducible test case in a reasonable time frame. >> Reproducing it with the application is easy and I will see what I can do >> about getting the source available. >> >> One interesting thing I can tell you is that if I remove the >> LinkedBlockingDeque from the mailbox of the Initiator the system still >> deadlocks. The cluster has a TCP mesh topology so any node can deliver >> messages to any other node. One of the connections goes dead and neither >> side detects that there is a problem. I add some assertions to the network >> selection thread to check that all the connections in the cluster are still >> healthy and assert that they have the correct interests set. >> >> Here are the things it checks for to make sure each connection is >> working: >> > for (ForeignHost.Port port : >> foreignHostPorts) { >> > assert(port.m_selectionKey.isValid()); >> > assert(port.m_selectionKey.selector() == >> m_selector); >> > assert(port.m_channel.isOpen()); >> > >> assert(((SocketChannel)port.m_channel).isConnected()); >> > >> assert(((SocketChannel)port.m_channel).socket().isInputShutdown() == false); >> > >> assert(((SocketChannel)port.m_channel).socket().isOutputShutdown() == >> false); >> > >> assert(((SocketChannel)port.m_channel).isOpen()); >> > >> assert(((SocketChannel)port.m_channel).isRegistered()); >> > >> assert(((SocketChannel)port.m_channel).keyFor(m_selector) != null); >> > >> assert(((SocketChannel)port.m_channel).keyFor(m_selector) == >> port.m_selectionKey); >> > if >> (m_selector.selectedKeys().contains(port.m_selectionKey)) { >> > >> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_READ) != 0); >> > >> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_WRITE) != 0); >> > } else { >> > if (port.isRunning()) { >> > >> assert(port.m_selectionKey.interestOps() == 0); >> > } else { >> > >> port.m_selectionKey.interestOps(SelectionKey.OP_READ | >> SelectionKey.OP_WRITE); >> > assert((port.interestOps() & >> SelectionKey.OP_READ) != 0); >> > assert((port.interestOps() & >> SelectionKey.OP_WRITE) != 0); >> > } >> > } >> > assert(m_selector.isOpen()); >> > >> assert(m_selector.keys().contains(port.m_selectionKey)); >> OP_READ | OP_WRITE is set as the interest ops every time through, and >> there is no other code that changes the interest ops during execution. The >> application will run for a while and then one of the connections will stop >> being selected on both sides. If I step in with the debugger on either side >> everything looks correct. The keys have the correct interest ops and the >> selectors have the keys in their key set. >> >> What I suspect is happening is that a bug on one node stops the socket >> from being selected (for both read and write), and eventually the socket >> fills up and can't be written to by the other side. >> >> If I can get my VPN access together tomorrow I will run with >> -XX:+UseMembar and also try running on some 8-core AMD machines. Otherwise I >> will have to get to it Wednesday. >> >> Thanks, >> >> Ariel Weisberg >> >> >> On Tue, 14 Jul 2009 05:00 +1000, "David Holmes" <davidchol...@aapt.net.au> >> wrote: >> >> Martin, >> >> I don't think this is due to LBQ/D. This is looking similar to a couple of >> other ReentrantLock/AQS "lost wakeup" hangs that I've got on the radar. We >> have a reprodeucible test case for one issue but it only fails on one kind >> of system - x4450. I'm on vacation most of this week but will try and get >> back to this next week. >> >> Ariel: one thing to try please see if -XX:+UseMembar fixes the problem. >> >> Thanks, >> David Holmes >> >> -----Original Message----- >> *From:* Martin Buchholz [mailto:marti...@google.com] >> *Sent:* Tuesday, 14 July 2009 8:38 AM >> *To:* Ariel Weisberg >> *Cc:* davidchol...@aapt.net.au; core-libs-dev; >> concurrency-inter...@cs.oswego.edu >> *Subject:* Re: [concurrency-interest] LinkedBlockingDeque deadlock? >> >> I did some stack trace eyeballing and did a mini-audit of the >> LinkedBlockingDeque code, with a view to finding possible bugs, >> and came up empty. Maybe it's a deep bug in hotspot? >> >> Ariel, it would be good if you could get a reproducible test case soonish, >> while someone on the planet has the motivation and familiarity to fix it. >> In another month I may disavow all knowledge of j.u.c.*Blocking* >> >> Martin >> >> >> On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg <ar...@weisberg.ws> wrote: >> >>> Hi, >>> >>> > The poll()ing thread is blocked waiting for the internal lock, but >>> > there's >>> > no indication of any thread owning that lock. You're using an OpenJDK 6 >>> > build ... can you try JDK7 ? >>> >>> I got a chance to do that today. I downloaded JDK 7 from >>> >>> http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin >>> and was able to reproduce the problem. I have attached the stack trace >>> from running the 1.7 version. It is the same situation as before except >>> there are 9 execution sites running on each host. There are no threads >>> that are missing or that have been restarted. Foo Network thread >>> (selector thread) and Network Thread - 0 are waiting on >>> 0x00002aaab43d3b28. I also ran with JDK 7 and 6 and LinkedBlockingQueue >>> and was not able to recreate the problem using that structure. >>> >>> > I don't recall anything similar to this, but I don't know what version >>> > that >>> > OpenJDK6 build relates to. >>> >>> The cluster is running on CentOS 5.3. >>> >[aweisb...@3f ~]$ rpm -qi java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5 >>> >Name : java-1.6.0-openjdk Relocations: (not >>> relocatable) >>> >Version : 1.6.0.0 Vendor: CentOS >>> >Release : 0.30.b09.el5 Build Date: Tue 07 Apr 2009 >>> 07:24:52 PM EDT >>> >Install Date: Thu 11 Jun 2009 03:27:46 PM EDT Build Host: >>> builder10.centos.org >>> >Group : Development/Languages Source RPM: >>> java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm >>> >Size : 76336266 License: GPLv2 with >>> exceptions >>> >Signature : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT, Key ID >>> a8a447dce8562897 >>> >URL : http://icedtea.classpath.org/ >>> >Summary : OpenJDK Runtime Environment >>> >Description : >>> >The OpenJDK runtime environment. >>> >>> > Make sure you haven't missed any exceptions occurring in other threads. >>> There are no threads missing in the application (terminated threads are >>> not replaced) and there is a try catch pair (prints error and rethrows) >>> around the run loop of each thread. It is possible that an exception may >>> have been swallowed up somewhere. >>> >>> >A small reproducible test case from you would be useful. >>> I am working on that. I wrote a test case that mimics the application's >>> use of the LBD, but I have not succeeded in reproducing the problem in >>> the test case. The app has a single thread (network selector) that polls >>> the LBD and several threads (ExecutionSites, and network threads that >>> return results from remote ExecutionSites) that offer results into the >>> queue. About 120k items will go into/out of the deque each second. In >>> the actual app the problem is reproducible but inconsistent. If I run on >>> my dual core laptop I can't reproduce it, and it is less likely to occur >>> with a small cluster, but with 6 nodes (~560k transactions/sec) the >>> problem will usually appear. Sometimes the cluster will run for several >>> minutes without issue and other times it will deadlock immediately. >>> >>> Thanks, >>> >>> Ariel >>> >>> On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz" >>> <marti...@google.com> wrote: >>> >[+core-libs-dev] >>> > >>> >Doug Lea and I are (slowly) working on a new version of >>> LinkedBlockingDeque. >>> >I was not aware of a deadlock but can vaguely imagine how it might >>> happen. >>> >A small reproducible test case from you would be useful. >>> > >>> >Unfinished work in progress can be found here: >>> >http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQueue/<http://cr.openjdk.java.net/%7Emartin/webrevs/openjdk7/BlockingQueue/> >>> > >>> >Martin >>> >>> On Wed, 08 Jul 2009 05:14 +1000, "David Holmes" >>> <davidchol...@aapt.net.au> wrote: >>> > >>> >>> > Ariel, >>> > >>> > The poll()ing thread is blocked waiting for the internal lock, but >>> > there's >>> > no indication of any thread owning that lock. You're using an OpenJDK 6 >>> > build ... can you try JDK7 ? >>> > >>> > I don't recall anything similar to this, but I don't know what version >>> > that >>> > OpenJDK6 build relates to. >>> > >>> > Make sure you haven't missed any exceptions occurring in other threads. >>> > >>> > David Holmes >>> > >>> > > -----Original Message----- >>> > > From: concurrency-interest-boun...@cs.oswego.edu >>> > > [mailto:concurrency-interest-boun...@cs.oswego.edu]on Behalf Of >>> Ariel >>> > > Weisberg >>> > > Sent: Wednesday, 8 July 2009 8:31 AM >>> > > To: concurrency-inter...@cs.oswego.edu >>> > > Subject: [concurrency-interest] LinkedBlockingDeque deadlock? >>> > > >>> > > >>> > > Hi all, >>> > > >>> > > I did a search on LinkedBlockingDeque and didn't find anything >>> similar >>> > > to what I am seeing. Attached is the stack trace from an application >>> > > that is deadlocked with three threads waiting for 0x00002aaab3e91080 >>> > > (threads "ExecutionSite: 26", "ExecutionSite:27", and "Network >>> > > Selector"). The execution sites are attempting to offer results to >>> the >>> > > deque and the network thread is trying to poll for them using the >>> > > non-blocking version of poll. I am seeing the network thread never >>> > > return from poll (straight poll()). Do my eyes deceive me? >>> > > >>> > > Thanks, >>> > > >>> > > Ariel Weisberg >>> > > >>> > >>> >> >> >> _______________________________________________ >> Concurrency-interest mailing list >> concurrency-inter...@cs.oswego.edu >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest >> >> >