This is definitely not a Cassandra bug, something external is causing those connection failures.
On Sat, Jun 19, 2010 at 3:12 PM, AJ Slater <a...@zuno.com> wrote: > Logging with TRACE reveals immediate problems with no client requests > coming in to the servers. The problem was immediate and persisted over > the course of half an hour: > > 10.33.2.70 lpc03 > 10.33.3.10 fs01 > 10.33.3.20 fs02 > > a...@lpc03:~$ grep unable /var/log/cassandra/output.log > TRACE 14:07:52,104 unable to connect to /10.33.3.10 > ... > TRACE 14:42:00,008 unable to connect to /10.33.3.20 > ... > TRACE 14:42:06,751 unable to connect to /10.33.3.20 > > Note that lpc03 has trouble talking to fs01 and fs02. But after After > seeing this I started logging TRACE on fs01 and fs02. > > During the six seconds before I restarted fs02: > > a...@fs01:~/logs$ grep unable /var/log/cassandra/output.log | grep unable > Bad configuration; unable to start server > TRACE 14:42:00,865 unable to connect to /10.33.3.20 > ... > TRACE 14:42:06,730 unable to connect to /10.33.3.20 > > Restarted fs02 and no issues in any of the logs. > > a...@fs02:~$ grep unable /var/log/cassandra/output.log > a...@fs02:~$ > > > > The unfiltered log messages all look more like: > > TRACE 14:42:06,248 unable to connect to /10.33.3.20 > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) > at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) > at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) > at java.net.Socket.connect(Socket.java:529) > at java.net.Socket.connect(Socket.java:478) > at java.net.Socket.<init>(Socket.java:375) > at java.net.Socket.<init>(Socket.java:276) > at > org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpCon > nection.java:149) > at > org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnect > ion.java:85) > > > On Sat, Jun 19, 2010 at 2:19 PM, AJ Slater <a...@zuno.com> wrote: >> I shall do just that. I did a bunch of tests this morning and the >> situation appears to be this: >> >> I have three nodes A, B and C, with RF=2. I understand now why this >> issue wasn't apparent with RF=3. >> >> If there are regular intranode column requests going on (e.g. i set up >> a pinger to get remote columns), the cluster functions normally. >> However, if no intranode column requests happen for a few hours, (3 >> hours is the minimum I've seen, but sometimes it takes longer), things >> go wrong. Using node A as the point of contact from the client, all >> columns that live on A are returned in a timely fashion, but for >> columns that only live on B & C, the retrieval times out, with this in >> the log: >> >> INFO 13:13:28,345 error writing to /10.33.3.20 >> >> No request for replicas, or consistency checks are seen in the logs of >> B & C at this time. Using 'nodetool ring' from each of the three nodes >> shows all nodes as Up. Telnet from A to B on port 7000 connects. >> Tcpdump logs look like, at first glance, that gossip communication, >> perhaps heartbeats, are proceeding normally, but I haven't really >> analyzed them. >> >> Fifteen minutes later, the cluster decided to behave normally again. >> Everyone talks to each other like buddies and delivers columns fast an >> regularly. >> >> This is really looking like a Cassandra bug. I'll report back with my >> TRACE log later and I expect I'll be opening a ticket. The confidence >> level of my employer in my Cassandra solution to their petabyte data >> storage project is... uh... well... it could be better. >> >> AJ >> >> >> On Fri, Jun 18, 2010 at 8:16 PM, Jonathan Ellis <jbel...@gmail.com> wrote: >>> set log level to TRACE and see if the OutboundTcpConnection is going >>> bad. that would explain the message never arriving. >>> >>> On Fri, Jun 18, 2010 at 10:39 AM, AJ Slater <a...@zuno.com> wrote: >>>> To summarize: >>>> >>>> If a request for a column comes in *after a period of several hours >>>> with no requests*, then the node servicing the request hangs while >>>> looking for its peer rather than servicing the request like it should. >>>> It then throws either a TimedOutException or a (wrong) >>>> NotFoundExeption. >>>> >>>> And it doen't appear to actually send the message it says it does to >>>> its peer. Or at least its peer doesn't report the request being >>>> received. >>>> >>>> And then the situation magically clears up after approximately 2 minutes. >>>> >>>> However, if the idle period never occurs, then the problem does not >>>> manifest. If I run a cron job with wget against my server every >>>> minute, I do not see the problem. >>>> >>>> I'll be looking at some tcpdump logs to see if i can suss out what's >>>> really happening, and perhaps file this as a bug. The several hours >>>> between reproducible events makes this whole thing aggravating for >>>> detection, debugging and I'll assume, fixing, if it is indeed a >>>> cassandra problem. >>>> >>>> It was suggested on IRC that it may be my network. But gossip is >>>> continually sending heartbeats and nodetool and the logs show the >>>> nodes as up and available. If my network was flaking out I'd think it >>>> would be dropping heartbeats and I'd see that. >>>> >>>> AJ >>>> >>>> On Thu, Jun 17, 2010 at 2:26 PM, AJ Slater <a...@zuno.com> wrote: >>>>> These are physical machines. >>>>> >>>>> storage-conf.xml.fs03 is here: >>>>> >>>>> http://pastebin.com/weL41NB1 >>>>> >>>>> Diffs from that for the other two storage-confs are inline here: >>>>> >>>>> a...@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 >>>>> storage-conf.xml.fs01 >>>>> 185c185 >>>>> >>>>>> <InitialToken>71603818521973537678586548668074777838</InitialToken> >>>>> 229c229 >>>>> < <ListenAddress>10.33.2.70</ListenAddress> >>>>> --- >>>>>> <ListenAddress>10.33.3.10</ListenAddress> >>>>> 241c241 >>>>> < <ThriftAddress>10.33.2.70</ThriftAddress> >>>>> --- >>>>>> <ThriftAddress>10.33.3.10</ThriftAddress> >>>>> 341c341 >>>>> < <ConcurrentReads>16</ConcurrentReads> >>>>> --- >>>>>> <ConcurrentReads>4</ConcurrentReads> >>>>> >>>>> >>>>> a...@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 >>>>> storage-conf.xml.fs02 >>>>> 185c185 >>>>> < <InitialToken>0</InitialToken> >>>>> --- >>>>>> <InitialToken>120215585224964746744782921158327379306</InitialToken> >>>>> 206d205 >>>>> < <Seed>10.33.3.20</Seed> >>>>> 229c228 >>>>> < <ListenAddress>10.33.2.70</ListenAddress> >>>>> --- >>>>>> <ListenAddress>10.33.3.20</ListenAddress> >>>>> 241c240 >>>>> < <ThriftAddress>10.33.2.70</ThriftAddress> >>>>> --- >>>>>> <ThriftAddress>10.33.3.20</ThriftAddress> >>>>> 341c340 >>>>> < <ConcurrentReads>16</ConcurrentReads> >>>>> --- >>>>>> <ConcurrentReads>4</ConcurrentReads> >>>>> >>>>> >>>>> Thank you for your attention, >>>>> >>>>> AJ >>>>> >>>>> >>>>> On Thu, Jun 17, 2010 at 2:09 PM, Benjamin Black <b...@b3k.us> wrote: >>>>>> Are these physical machines or virtuals? Did you post your >>>>>> cassandra.in.sh and storage-conf.xml someplace? >>>>>> >>>>>> On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater <a...@zuno.com> wrote: >>>>>>> Total data size in the entire cluster is about twenty 12k images. With >>>>>>> no other load on the system. I just ask for one column and I get these >>>>>>> timeouts. Performing multiple gets on the columns leads to multiple >>>>>>> timeouts for a period of a few seconds or minutes and then the >>>>>>> situation magically resolves itself and response times are down to >>>>>>> single digit milliseconds for a column get. >>>>>>> >>>>>>> On Thu, Jun 17, 2010 at 10:24 AM, AJ Slater <a...@zuno.com> wrote: >>>>>>>> Cassandra 0.6.2 from the apache debian source. >>>>>>>> Ubunutu Jaunty. Sun Java6 jvm. >>>>>>>> >>>>>>>> All nodes in separate racks at 365 main. >>>>>>>> >>>>>>>> On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater <a...@zuno.com> wrote: >>>>>>>>> I'm seing 10s timeouts on reads few times a day. Its hard to reproduce >>>>>>>>> consistently but seems to happen most often after its been a long time >>>>>>>>> between reads. After presenting itself for a couple minutes the >>>>>>>>> problem then goes away. >>>>>>>>> >>>>>>>>> I've got a three node cluster with replication factor 2, reading at >>>>>>>>> consistency level ONE. The columns being read are around 12k each. The >>>>>>>>> nodes are 8GB multicore boxes with the JVM limits between 4GB and 6GB. >>>>>>>>> >>>>>>>>> Here's an application log from early this morning when a developer in >>>>>>>>> Belgrade accessed the system: >>>>>>>>> >>>>>>>>> Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>>>>>>>> Requested image_id: 5827067133c3d670071c17d9144f0b49 >>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 | >>>>>>>>> TimedOutException for Image 5827067133c3d670071c17d9144f0b49 >>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>>>>>>> Get took 10005.388975 ms >>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>>>>>>>> Requested image_id: af8caf3b76ce97d13812ddf795104a5c >>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>>>>>>> Get took 3.658056 ms >>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>>>>>>> Transform took 0.978947 ms >>>>>>>>> >>>>>>>>> That's a Timeout and then a successful get of another column. >>>>>>>>> >>>>>>>>> Here's the cassandra log for 10.33.2.70: >>>>>>>>> >>>>>>>>> DEBUG 03:54:17,070 get_slice >>>>>>>>> DEBUG 03:54:17,071 weakreadremote reading >>>>>>>>> SliceFromReadCommand(table='jolitics.com', >>>>>>>>> key='5827067133c3d670071c17d9144f0b49', >>>>>>>>> column_parent='QueryPath(columnFamilyName='Images', >>>>>>>>> superColumnName='null', columnName='null')', start='', finish=' >>>>>>>>> ', reversed=false, count=100) >>>>>>>>> DEBUG 03:54:17,071 weakreadremote reading >>>>>>>>> SliceFromReadCommand(table='jolitics.com', >>>>>>>>> key='5827067133c3d670071c17d9144f0b49', >>>>>>>>> column_parent='QueryPath(columnFamilyName='Images', >>>>>>>>> superColumnName='null', columnName='null')', start='', finish=' >>>>>>>>> ', reversed=false, count=100) from 45138@/10.33.3.10 >>>>>>>>> DEBUG 03:54:27,077 get_slice >>>>>>>>> DEBUG 03:54:27,078 weakreadlocal reading >>>>>>>>> SliceFromReadCommand(table='jolitics.com', >>>>>>>>> key='af8caf3b76ce97d13812ddf795104a5c', >>>>>>>>> column_parent='QueryPath(columnFamilyName='Images', >>>>>>>>> superColumnName='null', columnName='null')', start='', finish='' >>>>>>>>> , reversed=false, count=100) >>>>>>>>> DEBUG 03:54:27,079 collecting body:false:1...@1275951327610885 >>>>>>>>> DEBUG 03:54:27,080 collecting body:false:1...@1275951327610885 >>>>>>>>> DEBUG 03:54:27,080 Reading consistency digest for >>>>>>>>> af8caf3b76ce97d13812ddf795104a >>>>>>>>> 5c from 45...@[/10.33.2.70, /10.33.3.10] >>>>>>>>> DEBUG 03:54:50,779 Disseminating load info ... >>>>>>>>> >>>>>>>>> It looks like it asks for key='5827067133c3d670071c17d9144f0b49' from >>>>>>>>> the local host and also queries 10.33.3.10 for the first one and then >>>>>>>>> for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local host >>>>>>>>> and then returns appropriately. >>>>>>>>> >>>>>>>>> Here's the log for 10.33.3.10 around that time: >>>>>>>>> >>>>>>>>> DEBUG 03:54:19,645 Disseminating load info ... >>>>>>>>> DEBUG 03:55:19,645 Disseminating load info ... >>>>>>>>> DEBUG 03:56:19,646 Disseminating load info ... >>>>>>>>> DEBUG 03:57:19,645 Disseminating load info ... >>>>>>>>> DEBUG 03:58:19,645 Disseminating load info ... >>>>>>>>> DEBUG 03:59:19,646 Disseminating load info ... >>>>>>>>> DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leaving >>>>>>>>> 55875144 used; max is 6580535296 >>>>>>>>> >>>>>>>>> No record of communication from 10.33.2.70. >>>>>>>>> >>>>>>>>> Does this ring any bells for anyone? I can of course attach >>>>>>>>> storage-conf's for all nodes if that sounds useful and I'll be on >>>>>>>>> #cassandra as ajslater. >>>>>>>>> >>>>>>>>> Much thanks for taking a look and any suggestions. We fear we'll have >>>>>>>>> to abandon Cassandra if this bug cannot be resolved. >>>>>>>>> >>>>>>>>> AJ >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder of Riptano, the source for professional Cassandra support >>> http://riptano.com >>> >> > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com