Are these physical machines or virtuals? Did you post your cassandra.in.sh and storage-conf.xml someplace?
On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater <a...@zuno.com> wrote: > Total data size in the entire cluster is about twenty 12k images. With > no other load on the system. I just ask for one column and I get these > timeouts. Performing multiple gets on the columns leads to multiple > timeouts for a period of a few seconds or minutes and then the > situation magically resolves itself and response times are down to > single digit milliseconds for a column get. > > On Thu, Jun 17, 2010 at 10:24 AM, AJ Slater <a...@zuno.com> wrote: >> Cassandra 0.6.2 from the apache debian source. >> Ubunutu Jaunty. Sun Java6 jvm. >> >> All nodes in separate racks at 365 main. >> >> On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater <a...@zuno.com> wrote: >>> I'm seing 10s timeouts on reads few times a day. Its hard to reproduce >>> consistently but seems to happen most often after its been a long time >>> between reads. After presenting itself for a couple minutes the >>> problem then goes away. >>> >>> I've got a three node cluster with replication factor 2, reading at >>> consistency level ONE. The columns being read are around 12k each. The >>> nodes are 8GB multicore boxes with the JVM limits between 4GB and 6GB. >>> >>> Here's an application log from early this morning when a developer in >>> Belgrade accessed the system: >>> >>> Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>> Requested image_id: 5827067133c3d670071c17d9144f0b49 >>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 | >>> TimedOutException for Image 5827067133c3d670071c17d9144f0b49 >>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>> Get took 10005.388975 ms >>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>> Requested image_id: af8caf3b76ce97d13812ddf795104a5c >>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>> Get took 3.658056 ms >>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>> Transform took 0.978947 ms >>> >>> That's a Timeout and then a successful get of another column. >>> >>> Here's the cassandra log for 10.33.2.70: >>> >>> DEBUG 03:54:17,070 get_slice >>> DEBUG 03:54:17,071 weakreadremote reading >>> SliceFromReadCommand(table='jolitics.com', >>> key='5827067133c3d670071c17d9144f0b49', >>> column_parent='QueryPath(columnFamilyName='Images', >>> superColumnName='null', columnName='null')', start='', finish=' >>> ', reversed=false, count=100) >>> DEBUG 03:54:17,071 weakreadremote reading >>> SliceFromReadCommand(table='jolitics.com', >>> key='5827067133c3d670071c17d9144f0b49', >>> column_parent='QueryPath(columnFamilyName='Images', >>> superColumnName='null', columnName='null')', start='', finish=' >>> ', reversed=false, count=100) from 45138@/10.33.3.10 >>> DEBUG 03:54:27,077 get_slice >>> DEBUG 03:54:27,078 weakreadlocal reading >>> SliceFromReadCommand(table='jolitics.com', >>> key='af8caf3b76ce97d13812ddf795104a5c', >>> column_parent='QueryPath(columnFamilyName='Images', >>> superColumnName='null', columnName='null')', start='', finish='' >>> , reversed=false, count=100) >>> DEBUG 03:54:27,079 collecting body:false:1...@1275951327610885 >>> DEBUG 03:54:27,080 collecting body:false:1...@1275951327610885 >>> DEBUG 03:54:27,080 Reading consistency digest for >>> af8caf3b76ce97d13812ddf795104a >>> 5c from 45...@[/10.33.2.70, /10.33.3.10] >>> DEBUG 03:54:50,779 Disseminating load info ... >>> >>> It looks like it asks for key='5827067133c3d670071c17d9144f0b49' from >>> the local host and also queries 10.33.3.10 for the first one and then >>> for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local host >>> and then returns appropriately. >>> >>> Here's the log for 10.33.3.10 around that time: >>> >>> DEBUG 03:54:19,645 Disseminating load info ... >>> DEBUG 03:55:19,645 Disseminating load info ... >>> DEBUG 03:56:19,646 Disseminating load info ... >>> DEBUG 03:57:19,645 Disseminating load info ... >>> DEBUG 03:58:19,645 Disseminating load info ... >>> DEBUG 03:59:19,646 Disseminating load info ... >>> DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leaving >>> 55875144 used; max is 6580535296 >>> >>> No record of communication from 10.33.2.70. >>> >>> Does this ring any bells for anyone? I can of course attach >>> storage-conf's for all nodes if that sounds useful and I'll be on >>> #cassandra as ajslater. >>> >>> Much thanks for taking a look and any suggestions. We fear we'll have >>> to abandon Cassandra if this bug cannot be resolved. >>> >>> AJ >>> >> >