Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Curt Bererton Mon, 17 May 2010 17:26:59 -0700

Thanks for the help guys:

First answering the first question: both cores are pegged:


Cpu0  : 43.8%us, 34.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
22.1%st
Cpu1  : 40.5%us, 36.2%sy,  0.0%ni,  0.4%id,  0.0%wa,  0.0%hi,  0.2%si,
22.6%st
Mem:   7872040k total,  3620180k used,  4251860k free,   388052k buffers
Swap:        0k total,        0k used,        0k free,  1655920k cached

Here's our current set up (mostly default) for storage-conf.xml:

<Storage>
  <ClusterName>zzpproduction</ClusterName>
  <AutoBootstrap>true</AutoBootstrap>

  <Keyspaces>
    <Keyspace Name="ks1">
      <ColumnFamily Name="A" CompareWith="BytesType"/>
      <ColumnFamily Name="B" CompareWith="BytesType"/>
      <ColumnFamily Name="C" CompareWith="BytesType"/>
      <ColumnFamily Name="D" CompareWith="BytesType"/>
      <ColumnFamily Name="E" CompareWith="BytesType"/>
      <ColumnFamily Name="F" CompareWith="BytesType"/>

<ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
      <ReplicationFactor>3</ReplicationFactor>

<EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
    </Keyspace>
  </Keyspaces>

<Authenticator>org.apache.cassandra.auth.AllowAllAuthenticator</Authenticator>
  <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>
  <!-- this gets set at server boot time -->
  <InitialToken>@CASSANDRA_TOKEN@</InitialToken>
  <CommitLogDirectory>/mnt/cassandra/commitlog</CommitLogDirectory>
  <DataFileDirectories>
      <DataFileDirectory>/mnt/cassandra/data</DataFileDirectory>
  </DataFileDirectories>
  <!-- gets set at server boot -->
  <Seeds>
   @DEPLOY_SEEDS@
  </Seeds>

  <RpcTimeoutInMillis>10000</RpcTimeoutInMillis>
  <CommitLogRotationThresholdInMB>128</CommitLogRotationThresholdInMB>
  <ListenAddress>@CASSANDRA_ADDRESS@</ListenAddress>
  <StoragePort>7000</StoragePort>
  <ThriftAddress>@CASSANDRA_ADDRESS@</ThriftAddress>
  <ThriftPort>9160</ThriftPort>
  <ThriftFramedTransport>false</ThriftFramedTransport>
  <DiskAccessMode>auto</DiskAccessMode>
  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
  <MemtableThroughputInMB>64</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
  <ConcurrentReads>8</ConcurrentReads>
  <ConcurrentWrites>32</ConcurrentWrites>
  <CommitLogSync>periodic</CommitLogSync>
  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
  <!-- <CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS> -->
  <GCGraceSeconds>864000</GCGraceSeconds>
</Storage>

We use QUORUM for all writes and reads. All instances are in the same region
(us-east-1b).

We haven't set up raids yet on the machines, (though I want to) for now the
commitLog and the data files are also on the same disk. Mostly I just want
to get the sucker up and running, and then I'll optimize the crap out of it.


I'm currently thinking that it might be some outright stupidity in our
client code. We use PHP with Pandra, but in looking through our client code
we don't call "disconnect" anywhere. We've got around 8 middle app servers
talking to 4 cassandra nodes. This might explain why there's 178 threads
showing up in jconsole? I don't know how many threads are typical. Looking
at a given thread in jconsole it typically says something like:
State: Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@6c7db013
Total Blocked:34 Total waited 3,631

Is that indicating that a thread is just sitting there waiting with an open
connection?

I'm looking into the above as well as trying out jmap right now.

Thanks for the suggestions, keep em coming. I'm hoping that it's just the
stupidity of not closing the connection from the client side..

Best,
Curt

--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/happyhabitat


On Mon, May 17, 2010 at 5:00 PM, Brandon Williams <dri...@gmail.com> wrote:

> On Mon, May 17, 2010 at 6:02 PM, Curt Bererton <c...@zipzapplay.com>wrote:
>
>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>> hammered right now, and it is receiving 0 ops/sec from me since I
>> disconnected it from our application right now until I can figure out what's
>> going on.
>>
>> running top on the machine I get:
>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>> 15.13
>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>> 24.6%st
>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>
>
> Since your heap isn't anywhere near exhausted, I don't think you have a GC
> storm happening.  Is it one core or both that are pegged?  One way to tell
> which thread is using all the CPU is to run top -H so you can see the
> threads, get the pid of the one using the CPU, convert that to hex, then run
> jmap <main java pid> and grep for the hex.
>
> -Brandon
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Reply via email to