Re: constant CMS GC using CPU time

2012-10-26 Thread aaron morton
> How does compaction_throughput relate to memory usage?  
It reduces the rate of memory allocation. 
e.g. Say normally ParNew can keep up with the rate of memory usage without 
stopping for too long: so the rate of promotion is low'ish and every thing is 
allocated to Eden. If the allocation rate gets higher ParNew may be more 
frequent and objects may be promoted to tenured that don't really need to be 
there.  

>  I assumed that was more for IO tuning.  I noticed that lowering 
> concurrent_compactors to 4 (from default of 8) lowered the memory used during 
> compactions.
Similar thing to above. This may reduce the number of rows held in memory at 
any instant for compaction. 

Only rows less than in_memory_compaction_limit are loaded into memory during 
compaction. So reducing that may reduce the memory usage.

>  Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so 
> the number of rows and data dropped to a level it can handle.
Cool. Sorry if took so long to get there. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 8:08 AM, Bryan Talbot  wrote:

> On Thu, Oct 25, 2012 at 4:15 AM, aaron morton  wrote:
>> This sounds very much like "my heap is so consumed by (mostly) bloom
>> filters that I am in steady state GC thrash."
>> 
>> Yes, I think that was at least part of the issue.
> 
> The rough numbers I've used to estimate working set are:
> 
> * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are 
> just a big array) 714 MB
> * memtable size 1024 MB 
> * index sampling:
>   *  24 bytes + key (16 bytes for UUID) = 32 bytes 
>   * 400M / 128 default sampling = 3,125,000
>   *  3,125,000 * 32 = 95 MB
>   * java fudge X5 or X10 = 475MB to 950MB
> * ignoring row cache and key cache
>  
> So the high side number is 2213 to 2,688. High because the fudge is a 
> delicious sticky guess and the memtable space would rarely be full. 
> 
> On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some 
> goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the 
> working set and how quickly stuff get's tenured which depends on the 
> workload. 
> 
> These values seem reasonable and in line with what I was seeing.  There are 
> other CF and apps sharing this cluster but this one was the largest.  
> 
> 
>   
> 
> You can confirm these guesses somewhat manually by enabling all the GC 
> logging in cassandra-env.sh. Restart the node and let it operate normally, 
> probably best to keep repair off.
> 
> 
> 
> I was using jstat to monitor gc activity and some snippets from that are in 
> my original email in this thread.  The key behavior was that full gc was 
> running pretty often and never able to reclaim much (if any) space.
> 
> 
>  
> 
> There are a few things you could try:
> 
> * increase the JVM heap by say 1Gb and see how it goes
> * increase bloom filter false positive,  try 0.1 first (see 
> http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance)
>  
> * increase index_interval sampling in yaml.  
> * decreasing compaction_throughput and in_memory_compaction_limit can lesson 
> the additional memory pressure compaction adds. 
> * disable caches or ensure off heap caches are used.
> 
> I've done several of these already in addition to changing the app to reduce 
> the number of rows retained.  How does compaction_throughput relate to memory 
> usage?  I assumed that was more for IO tuning.  I noticed that lowering 
> concurrent_compactors to 4 (from default of 8) lowered the memory used during 
> compactions.  in_memory_compaction_limit_in_mb seems to only be used for wide 
> rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb. 
>  My multithreaded_compaction is still false.
> 
>  
> 
> Watching the gc logs and the cassandra log is a great way to get a feel for 
> what works in your situation. Also take note of any scheduled processing your 
> app does which may impact things, and look for poorly performing queries. 
> 
> Finally this book is a good reference on Java GC http://amzn.com/0137142528 
> 
> For my understanding what was the average row size for the 400 million keys ? 
> 
> 
> 
> The compacted row mean size for the CF is 8815 (as reported by cfstats) but 
> that comes out to be much larger than the real load per node I was seeing.  
> Each node had about 200GB of data for the CF with 4 nodes in the cluster and 
> RF=3.  At the time, the TTL for all columns was 3 days and gc_grace_seconds 
> was 5 days.  Since then I've reduced the TTL to 1 hour and set 
> gc_grace_seconds to 0 so the number of rows and data dropped to a level it 
> can handle.
> 
> 
> -Bryan
> 



Re: What does ReadRepair exactly do?

2012-10-26 Thread aaron morton
>> replicas but to ensure we read at least one newest value as long as write
>> quorum succeeded beforehand and W+R > N.
> 
This is correct.
It's not that a quorum of nodes agree it's that a quorum of nodes participate. 
If a quorum participate in both the write and read you are guaranteed that one 
node was involved in both. The wikipedia definition helps here "A quorum is the 
minimum number of members of a deliberative assembly necessary to conduct the 
business of that group" http://en.wikipedia.org/wiki/Quorum  

It's a two step process: First do we have enough people to make a decision? 
Second following the rules what was the decision?

In C* the rule is to use the value with the highest time stamp. Not the value 
with the highest number of  "votes". The red boxes on this slide are the 
winning values 
http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/67  
(thinking one of my slides in that deck may have been misleading in the past). 
In Riak the rule is to use Vector Clocks. 

So 
> I agree that returning val4 is the right thing to do if quorum (two) nodes
> among (node1,node2,node3) have the val4
Is incorrect.
We return the value with the highest time stamp returned from the nodes 
involved in the read. Only one needs to have val4. 

> The heart of the problem
> here is that the coordinator responds to a client request "assuming" that
> the consistency has been achieved the moment is issues a row repair with the
> super-set of the resolved value; without receiving acknowledgement on the
> success of a repair from the replicas for a given consistency constraint. 
and
> My intuition behind saying this is because we
> would respond to the client without the replicas having confirmed their
> meeting the consistency requirement.

It is not necessary for the coordinator to wait. 

Consider an example: The app has stopped writing to the cluster, for a certain 
column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 
respectively. The last write was a successful CL QUORUM write of bar with 
timestamp 2. However node 3 did acknowledge this write for some reason. 

To make it interesting the commit log volume on node 3 is full. Mutations are 
blocking in the commit log queue so any write on node 3 will timeout and fail, 
but reads are still working. We could imagine this is why node 3 did not commit 
bar:2 

Some read examples, RR is not active:

1) Client reads from node 4 (a non replica) with CL QUOURM, request goes to 
nodes 1 and 2. Both agree on bar as value. 
2) Client reads from node 3 with CL QUORUM, request is processed locally and on 
node 2.
* There is a digest mismatch
* Row Repair read runs to read from for nodes 2 and 3.
* The super set resolves to bar:2
* Node 3 (the coordinator) queues a delta write locally to write bar:2. 
No other delta writes are sent.
* Node 3 returns bar:2 to the client
3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens and 
bar:2 is returned. 
4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly the 
same thing as (2) happens and bar:2 is returned. 
5) Client reads from node 1 as CL ONE. Read happens locally only and returns 
bar:2
6) Client reads from node 3 as CL ONE. Read happens locally only and returns 
foo:1

So:
* A read CL QUOURM will always return bar:2 even if node 3 only has foo:1 on 
disk. 
* A read at CL ONE will return no value or any previous write.

The delta write from the Row Repair goes to a single node so R + W > N cannot 
be applied. It can almost be thought of as  internal implementation. The delta 
write from a Digest Mismatch, HH writes, full RR writes and nodetool repair are 
used to:

* Reduce the chance of a Digest Mismatch when CL > ONE
* Eventually reach a state where reads at any CL return the last write. 

They are not used to ensure strong consistency when R + W > N. You could turn 
those things off and R + W > N would still work. 
 
Hope that helps. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 7:15 AM, shankarpnsn  wrote:

> manuzhang wrote
>> read quorum doesn't mean we read newest values from a quorum number of
>> replicas but to ensure we read at least one newest value as long as write
>> quorum succeeded beforehand and W+R > N.
> 
> I beg to differ here. Any read/write, by definition of quorum, should have
> at least n/2 + 1 replicas that agree on that read/write value. Responding to
> the user with a newer value, even if the write creating the new value hasn't
> completed cannot guarantee any read consistency > 1. 
> 
> 
> Hiller, Dean wrote
>>> Kind of an interesting question
>>> 
>>> I think you are saying if a client read resolved only the two nodes as
>>> said in Aaron's email back to the client and read -repair was kicked off
>>> because of the inconsistent values and the write did not complete yet and
>>> I guess you would have two nodes go dow

Re: Hinted Handoff storage inflation

2012-10-26 Thread Mattias Larsson

On Oct 24, 2012, at 6:05 PM, aaron morton wrote:

> Hints store the columns, row key, KS name and CF id(s) for each mutation to 
> each node. Where an executed mutation will store the most recent columns 
> collated with others under the same row key. So depending on the type of 
> mutation hints will take up more space. 
> 
> The worse case would be lots of overwrites. After that writing a small amount 
> of data to many rows would result in a lot of the serialised space being 
> devoted to row keys, KS name and CF id.
> 
> 16Gb is a lot though. What was the write workload like ?

Each write is new data only (no overwrites). Each mutation adds a row to one 
column family with a column containing about ~100 bytes of data and a new row 
to another column family with a SuperColumn containing 2x17KiB payloads. These 
are sent in batches with several in them, but I found that the storage overhead 
was the same regardless of the size of the batch mutation (i.e., 5 vs 25 
mutations made no difference). A total of 1,000,000 mutations like these are 
sent over the duration of the test.


> You can get an estimate on the number of keys in the Hints CF using nodetool 
> cfstats. Also some metrics in the JMX will tell you how many hints are 
> stored. 
> 
>> This has a huge impact on write performance as well.
> Yup. Hints are added to the same Mutation thread pool as normal mutations. 
> They are processed async to the mutation request but they still take 
> resources to store. 
> 
> You can adjust how long hints a collected for with max_hint_window_in_ms in 
> the yaml file. 
> 
> How long did the test run for ? 
> 

With both data centers functional, the test takes just a few minutes to run, 
with one data center down, 15x the amount of time.

/dml