These are physical machines.

storage-conf.xml.fs03 is here:

http://pastebin.com/weL41NB1

Diffs from that for the other two storage-confs are inline here:

a...@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03
storage-conf.xml.fs01
185c185

>   <InitialToken>71603818521973537678586548668074777838</InitialToken>
229c229
<   <ListenAddress>10.33.2.70</ListenAddress>
---
>   <ListenAddress>10.33.3.10</ListenAddress>
241c241
<   <ThriftAddress>10.33.2.70</ThriftAddress>
---
>   <ThriftAddress>10.33.3.10</ThriftAddress>
341c341
<   <ConcurrentReads>16</ConcurrentReads>
---
>   <ConcurrentReads>4</ConcurrentReads>


a...@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03
storage-conf.xml.fs02
185c185
<   <InitialToken>0</InitialToken>
---
>   <InitialToken>120215585224964746744782921158327379306</InitialToken>
206d205
<       <Seed>10.33.3.20</Seed>
229c228
<   <ListenAddress>10.33.2.70</ListenAddress>
---
>   <ListenAddress>10.33.3.20</ListenAddress>
241c240
<   <ThriftAddress>10.33.2.70</ThriftAddress>
---
>   <ThriftAddress>10.33.3.20</ThriftAddress>
341c340
<   <ConcurrentReads>16</ConcurrentReads>
---
>   <ConcurrentReads>4</ConcurrentReads>


Thank you for your attention,

AJ


On Thu, Jun 17, 2010 at 2:09 PM, Benjamin Black <b...@b3k.us> wrote:
> Are these physical machines or virtuals?  Did you post your
> cassandra.in.sh and storage-conf.xml someplace?
>
> On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater <a...@zuno.com> wrote:
>> Total data size in the entire cluster is about twenty 12k images. With
>> no other load on the system. I just ask for one column and I get these
>> timeouts. Performing multiple gets on the columns leads to multiple
>> timeouts for a period of a few seconds or minutes and then the
>> situation magically resolves itself and response times are down to
>> single digit milliseconds for a column get.
>>
>> On Thu, Jun 17, 2010 at 10:24 AM, AJ Slater <a...@zuno.com> wrote:
>>> Cassandra 0.6.2 from the apache debian source.
>>> Ubunutu Jaunty. Sun Java6 jvm.
>>>
>>> All nodes in separate racks at 365 main.
>>>
>>> On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater <a...@zuno.com> wrote:
>>>> I'm seing 10s timeouts on reads few times a day. Its hard to reproduce
>>>> consistently but seems to happen most often after its been a long time
>>>> between reads. After presenting itself for a couple minutes the
>>>> problem then goes away.
>>>>
>>>> I've got a three node cluster with replication factor 2, reading at
>>>> consistency level ONE. The columns being read are around 12k each. The
>>>> nodes are 8GB multicore boxes with the JVM limits between 4GB and 6GB.
>>>>
>>>> Here's an application log from early this morning when a developer in
>>>> Belgrade accessed the system:
>>>>
>>>> Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 |
>>>> Requested image_id: 5827067133c3d670071c17d9144f0b49
>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 |
>>>> TimedOutException for Image 5827067133c3d670071c17d9144f0b49
>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>>> Get took 10005.388975 ms
>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 |
>>>> Requested image_id: af8caf3b76ce97d13812ddf795104a5c
>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>>> Get took 3.658056 ms
>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>>> Transform took 0.978947 ms
>>>>
>>>> That's a Timeout and then a successful get of another column.
>>>>
>>>> Here's the cassandra log for 10.33.2.70:
>>>>
>>>> DEBUG 03:54:17,070 get_slice
>>>> DEBUG 03:54:17,071 weakreadremote reading
>>>> SliceFromReadCommand(table='jolitics.com',
>>>> key='5827067133c3d670071c17d9144f0b49',
>>>> column_parent='QueryPath(columnFamilyName='Images',
>>>> superColumnName='null', columnName='null')', start='', finish='
>>>> ', reversed=false, count=100)
>>>> DEBUG 03:54:17,071 weakreadremote reading
>>>> SliceFromReadCommand(table='jolitics.com',
>>>> key='5827067133c3d670071c17d9144f0b49',
>>>> column_parent='QueryPath(columnFamilyName='Images',
>>>> superColumnName='null', columnName='null')', start='', finish='
>>>> ', reversed=false, count=100) from 45138@/10.33.3.10
>>>> DEBUG 03:54:27,077 get_slice
>>>> DEBUG 03:54:27,078 weakreadlocal reading
>>>> SliceFromReadCommand(table='jolitics.com',
>>>> key='af8caf3b76ce97d13812ddf795104a5c',
>>>> column_parent='QueryPath(columnFamilyName='Images',
>>>> superColumnName='null', columnName='null')', start='', finish=''
>>>> , reversed=false, count=100)
>>>> DEBUG 03:54:27,079 collecting body:false:1...@1275951327610885
>>>> DEBUG 03:54:27,080 collecting body:false:1...@1275951327610885
>>>> DEBUG 03:54:27,080 Reading consistency digest for 
>>>> af8caf3b76ce97d13812ddf795104a
>>>> 5c from 45...@[/10.33.2.70, /10.33.3.10]
>>>> DEBUG 03:54:50,779 Disseminating load info ...
>>>>
>>>> It looks like it asks for key='5827067133c3d670071c17d9144f0b49' from
>>>> the local host and also queries 10.33.3.10 for the first one and then
>>>> for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local host
>>>> and then returns appropriately.
>>>>
>>>> Here's the log for 10.33.3.10 around that time:
>>>>
>>>> DEBUG 03:54:19,645 Disseminating load info ...
>>>> DEBUG 03:55:19,645 Disseminating load info ...
>>>> DEBUG 03:56:19,646 Disseminating load info ...
>>>> DEBUG 03:57:19,645 Disseminating load info ...
>>>> DEBUG 03:58:19,645 Disseminating load info ...
>>>> DEBUG 03:59:19,646 Disseminating load info ...
>>>> DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leaving
>>>> 55875144 used; max is 6580535296
>>>>
>>>> No record of communication from 10.33.2.70.
>>>>
>>>> Does this ring any bells for anyone? I can of course attach
>>>> storage-conf's for all nodes if that sounds useful and I'll be on
>>>> #cassandra as ajslater.
>>>>
>>>> Much thanks for taking a look and any suggestions. We fear we'll have
>>>> to abandon Cassandra if this bug cannot be resolved.
>>>>
>>>> AJ
>>>>
>>>
>>
>

Reply via email to