set log level to TRACE and see if the OutboundTcpConnection is going bad. that would explain the message never arriving.
On Fri, Jun 18, 2010 at 10:39 AM, AJ Slater <a...@zuno.com> wrote: > To summarize: > > If a request for a column comes in *after a period of several hours > with no requests*, then the node servicing the request hangs while > looking for its peer rather than servicing the request like it should. > It then throws either a TimedOutException or a (wrong) > NotFoundExeption. > > And it doen't appear to actually send the message it says it does to > its peer. Or at least its peer doesn't report the request being > received. > > And then the situation magically clears up after approximately 2 minutes. > > However, if the idle period never occurs, then the problem does not > manifest. If I run a cron job with wget against my server every > minute, I do not see the problem. > > I'll be looking at some tcpdump logs to see if i can suss out what's > really happening, and perhaps file this as a bug. The several hours > between reproducible events makes this whole thing aggravating for > detection, debugging and I'll assume, fixing, if it is indeed a > cassandra problem. > > It was suggested on IRC that it may be my network. But gossip is > continually sending heartbeats and nodetool and the logs show the > nodes as up and available. If my network was flaking out I'd think it > would be dropping heartbeats and I'd see that. > > AJ > > On Thu, Jun 17, 2010 at 2:26 PM, AJ Slater <a...@zuno.com> wrote: >> These are physical machines. >> >> storage-conf.xml.fs03 is here: >> >> http://pastebin.com/weL41NB1 >> >> Diffs from that for the other two storage-confs are inline here: >> >> a...@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 >> storage-conf.xml.fs01 >> 185c185 >> >>> <InitialToken>71603818521973537678586548668074777838</InitialToken> >> 229c229 >> < <ListenAddress>10.33.2.70</ListenAddress> >> --- >>> <ListenAddress>10.33.3.10</ListenAddress> >> 241c241 >> < <ThriftAddress>10.33.2.70</ThriftAddress> >> --- >>> <ThriftAddress>10.33.3.10</ThriftAddress> >> 341c341 >> < <ConcurrentReads>16</ConcurrentReads> >> --- >>> <ConcurrentReads>4</ConcurrentReads> >> >> >> a...@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 >> storage-conf.xml.fs02 >> 185c185 >> < <InitialToken>0</InitialToken> >> --- >>> <InitialToken>120215585224964746744782921158327379306</InitialToken> >> 206d205 >> < <Seed>10.33.3.20</Seed> >> 229c228 >> < <ListenAddress>10.33.2.70</ListenAddress> >> --- >>> <ListenAddress>10.33.3.20</ListenAddress> >> 241c240 >> < <ThriftAddress>10.33.2.70</ThriftAddress> >> --- >>> <ThriftAddress>10.33.3.20</ThriftAddress> >> 341c340 >> < <ConcurrentReads>16</ConcurrentReads> >> --- >>> <ConcurrentReads>4</ConcurrentReads> >> >> >> Thank you for your attention, >> >> AJ >> >> >> On Thu, Jun 17, 2010 at 2:09 PM, Benjamin Black <b...@b3k.us> wrote: >>> Are these physical machines or virtuals? Did you post your >>> cassandra.in.sh and storage-conf.xml someplace? >>> >>> On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater <a...@zuno.com> wrote: >>>> Total data size in the entire cluster is about twenty 12k images. With >>>> no other load on the system. I just ask for one column and I get these >>>> timeouts. Performing multiple gets on the columns leads to multiple >>>> timeouts for a period of a few seconds or minutes and then the >>>> situation magically resolves itself and response times are down to >>>> single digit milliseconds for a column get. >>>> >>>> On Thu, Jun 17, 2010 at 10:24 AM, AJ Slater <a...@zuno.com> wrote: >>>>> Cassandra 0.6.2 from the apache debian source. >>>>> Ubunutu Jaunty. Sun Java6 jvm. >>>>> >>>>> All nodes in separate racks at 365 main. >>>>> >>>>> On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater <a...@zuno.com> wrote: >>>>>> I'm seing 10s timeouts on reads few times a day. Its hard to reproduce >>>>>> consistently but seems to happen most often after its been a long time >>>>>> between reads. After presenting itself for a couple minutes the >>>>>> problem then goes away. >>>>>> >>>>>> I've got a three node cluster with replication factor 2, reading at >>>>>> consistency level ONE. The columns being read are around 12k each. The >>>>>> nodes are 8GB multicore boxes with the JVM limits between 4GB and 6GB. >>>>>> >>>>>> Here's an application log from early this morning when a developer in >>>>>> Belgrade accessed the system: >>>>>> >>>>>> Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>>>>> Requested image_id: 5827067133c3d670071c17d9144f0b49 >>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 | >>>>>> TimedOutException for Image 5827067133c3d670071c17d9144f0b49 >>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>>>> Get took 10005.388975 ms >>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>>>>> Requested image_id: af8caf3b76ce97d13812ddf795104a5c >>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>>>> Get took 3.658056 ms >>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>>>> Transform took 0.978947 ms >>>>>> >>>>>> That's a Timeout and then a successful get of another column. >>>>>> >>>>>> Here's the cassandra log for 10.33.2.70: >>>>>> >>>>>> DEBUG 03:54:17,070 get_slice >>>>>> DEBUG 03:54:17,071 weakreadremote reading >>>>>> SliceFromReadCommand(table='jolitics.com', >>>>>> key='5827067133c3d670071c17d9144f0b49', >>>>>> column_parent='QueryPath(columnFamilyName='Images', >>>>>> superColumnName='null', columnName='null')', start='', finish=' >>>>>> ', reversed=false, count=100) >>>>>> DEBUG 03:54:17,071 weakreadremote reading >>>>>> SliceFromReadCommand(table='jolitics.com', >>>>>> key='5827067133c3d670071c17d9144f0b49', >>>>>> column_parent='QueryPath(columnFamilyName='Images', >>>>>> superColumnName='null', columnName='null')', start='', finish=' >>>>>> ', reversed=false, count=100) from 45138@/10.33.3.10 >>>>>> DEBUG 03:54:27,077 get_slice >>>>>> DEBUG 03:54:27,078 weakreadlocal reading >>>>>> SliceFromReadCommand(table='jolitics.com', >>>>>> key='af8caf3b76ce97d13812ddf795104a5c', >>>>>> column_parent='QueryPath(columnFamilyName='Images', >>>>>> superColumnName='null', columnName='null')', start='', finish='' >>>>>> , reversed=false, count=100) >>>>>> DEBUG 03:54:27,079 collecting body:false:1...@1275951327610885 >>>>>> DEBUG 03:54:27,080 collecting body:false:1...@1275951327610885 >>>>>> DEBUG 03:54:27,080 Reading consistency digest for >>>>>> af8caf3b76ce97d13812ddf795104a >>>>>> 5c from 45...@[/10.33.2.70, /10.33.3.10] >>>>>> DEBUG 03:54:50,779 Disseminating load info ... >>>>>> >>>>>> It looks like it asks for key='5827067133c3d670071c17d9144f0b49' from >>>>>> the local host and also queries 10.33.3.10 for the first one and then >>>>>> for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local host >>>>>> and then returns appropriately. >>>>>> >>>>>> Here's the log for 10.33.3.10 around that time: >>>>>> >>>>>> DEBUG 03:54:19,645 Disseminating load info ... >>>>>> DEBUG 03:55:19,645 Disseminating load info ... >>>>>> DEBUG 03:56:19,646 Disseminating load info ... >>>>>> DEBUG 03:57:19,645 Disseminating load info ... >>>>>> DEBUG 03:58:19,645 Disseminating load info ... >>>>>> DEBUG 03:59:19,646 Disseminating load info ... >>>>>> DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leaving >>>>>> 55875144 used; max is 6580535296 >>>>>> >>>>>> No record of communication from 10.33.2.70. >>>>>> >>>>>> Does this ring any bells for anyone? I can of course attach >>>>>> storage-conf's for all nodes if that sounds useful and I'll be on >>>>>> #cassandra as ajslater. >>>>>> >>>>>> Much thanks for taking a look and any suggestions. We fear we'll have >>>>>> to abandon Cassandra if this bug cannot be resolved. >>>>>> >>>>>> AJ >>>>>> >>>>> >>>> >>> >> > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com