Re: constant CMS GC using CPU time
> How does compaction_throughput relate to memory usage? It reduces the rate of memory allocation. e.g. Say normally ParNew can keep up with the rate of memory usage without stopping for too long: so the rate of promotion is low'ish and every thing is allocated to Eden. If the allocation rate gets higher ParNew may be more frequent and objects may be promoted to tenured that don't really need to be there. > I assumed that was more for IO tuning. I noticed that lowering > concurrent_compactors to 4 (from default of 8) lowered the memory used during > compactions. Similar thing to above. This may reduce the number of rows held in memory at any instant for compaction. Only rows less than in_memory_compaction_limit are loaded into memory during compaction. So reducing that may reduce the memory usage. > Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so > the number of rows and data dropped to a level it can handle. Cool. Sorry if took so long to get there. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/10/2012, at 8:08 AM, Bryan Talbot wrote: > On Thu, Oct 25, 2012 at 4:15 AM, aaron morton wrote: >> This sounds very much like "my heap is so consumed by (mostly) bloom >> filters that I am in steady state GC thrash." >> >> Yes, I think that was at least part of the issue. > > The rough numbers I've used to estimate working set are: > > * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are > just a big array) 714 MB > * memtable size 1024 MB > * index sampling: > * 24 bytes + key (16 bytes for UUID) = 32 bytes > * 400M / 128 default sampling = 3,125,000 > * 3,125,000 * 32 = 95 MB > * java fudge X5 or X10 = 475MB to 950MB > * ignoring row cache and key cache > > So the high side number is 2213 to 2,688. High because the fudge is a > delicious sticky guess and the memtable space would rarely be full. > > On a 5120 MB heap, with 800MB new you have roughly 4300 MB tenured (some > goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the > working set and how quickly stuff get's tenured which depends on the > workload. > > These values seem reasonable and in line with what I was seeing. There are > other CF and apps sharing this cluster but this one was the largest. > > > > > You can confirm these guesses somewhat manually by enabling all the GC > logging in cassandra-env.sh. Restart the node and let it operate normally, > probably best to keep repair off. > > > > I was using jstat to monitor gc activity and some snippets from that are in > my original email in this thread. The key behavior was that full gc was > running pretty often and never able to reclaim much (if any) space. > > > > > There are a few things you could try: > > * increase the JVM heap by say 1Gb and see how it goes > * increase bloom filter false positive, try 0.1 first (see > http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance) > > * increase index_interval sampling in yaml. > * decreasing compaction_throughput and in_memory_compaction_limit can lesson > the additional memory pressure compaction adds. > * disable caches or ensure off heap caches are used. > > I've done several of these already in addition to changing the app to reduce > the number of rows retained. How does compaction_throughput relate to memory > usage? I assumed that was more for IO tuning. I noticed that lowering > concurrent_compactors to 4 (from default of 8) lowered the memory used during > compactions. in_memory_compaction_limit_in_mb seems to only be used for wide > rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb. > My multithreaded_compaction is still false. > > > > Watching the gc logs and the cassandra log is a great way to get a feel for > what works in your situation. Also take note of any scheduled processing your > app does which may impact things, and look for poorly performing queries. > > Finally this book is a good reference on Java GC http://amzn.com/0137142528 > > For my understanding what was the average row size for the 400 million keys ? > > > > The compacted row mean size for the CF is 8815 (as reported by cfstats) but > that comes out to be much larger than the real load per node I was seeing. > Each node had about 200GB of data for the CF with 4 nodes in the cluster and > RF=3. At the time, the TTL for all columns was 3 days and gc_grace_seconds > was 5 days. Since then I've reduced the TTL to 1 hour and set > gc_grace_seconds to 0 so the number of rows and data dropped to a level it > can handle. > > > -Bryan >
Re: What does ReadRepair exactly do?
>> replicas but to ensure we read at least one newest value as long as write >> quorum succeeded beforehand and W+R > N. > This is correct. It's not that a quorum of nodes agree it's that a quorum of nodes participate. If a quorum participate in both the write and read you are guaranteed that one node was involved in both. The wikipedia definition helps here "A quorum is the minimum number of members of a deliberative assembly necessary to conduct the business of that group" http://en.wikipedia.org/wiki/Quorum It's a two step process: First do we have enough people to make a decision? Second following the rules what was the decision? In C* the rule is to use the value with the highest time stamp. Not the value with the highest number of "votes". The red boxes on this slide are the winning values http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/67 (thinking one of my slides in that deck may have been misleading in the past). In Riak the rule is to use Vector Clocks. So > I agree that returning val4 is the right thing to do if quorum (two) nodes > among (node1,node2,node3) have the val4 Is incorrect. We return the value with the highest time stamp returned from the nodes involved in the read. Only one needs to have val4. > The heart of the problem > here is that the coordinator responds to a client request "assuming" that > the consistency has been achieved the moment is issues a row repair with the > super-set of the resolved value; without receiving acknowledgement on the > success of a repair from the replicas for a given consistency constraint. and > My intuition behind saying this is because we > would respond to the client without the replicas having confirmed their > meeting the consistency requirement. It is not necessary for the coordinator to wait. Consider an example: The app has stopped writing to the cluster, for a certain column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 respectively. The last write was a successful CL QUORUM write of bar with timestamp 2. However node 3 did acknowledge this write for some reason. To make it interesting the commit log volume on node 3 is full. Mutations are blocking in the commit log queue so any write on node 3 will timeout and fail, but reads are still working. We could imagine this is why node 3 did not commit bar:2 Some read examples, RR is not active: 1) Client reads from node 4 (a non replica) with CL QUOURM, request goes to nodes 1 and 2. Both agree on bar as value. 2) Client reads from node 3 with CL QUORUM, request is processed locally and on node 2. * There is a digest mismatch * Row Repair read runs to read from for nodes 2 and 3. * The super set resolves to bar:2 * Node 3 (the coordinator) queues a delta write locally to write bar:2. No other delta writes are sent. * Node 3 returns bar:2 to the client 3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens and bar:2 is returned. 4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly the same thing as (2) happens and bar:2 is returned. 5) Client reads from node 1 as CL ONE. Read happens locally only and returns bar:2 6) Client reads from node 3 as CL ONE. Read happens locally only and returns foo:1 So: * A read CL QUOURM will always return bar:2 even if node 3 only has foo:1 on disk. * A read at CL ONE will return no value or any previous write. The delta write from the Row Repair goes to a single node so R + W > N cannot be applied. It can almost be thought of as internal implementation. The delta write from a Digest Mismatch, HH writes, full RR writes and nodetool repair are used to: * Reduce the chance of a Digest Mismatch when CL > ONE * Eventually reach a state where reads at any CL return the last write. They are not used to ensure strong consistency when R + W > N. You could turn those things off and R + W > N would still work. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/10/2012, at 7:15 AM, shankarpnsn wrote: > manuzhang wrote >> read quorum doesn't mean we read newest values from a quorum number of >> replicas but to ensure we read at least one newest value as long as write >> quorum succeeded beforehand and W+R > N. > > I beg to differ here. Any read/write, by definition of quorum, should have > at least n/2 + 1 replicas that agree on that read/write value. Responding to > the user with a newer value, even if the write creating the new value hasn't > completed cannot guarantee any read consistency > 1. > > > Hiller, Dean wrote >>> Kind of an interesting question >>> >>> I think you are saying if a client read resolved only the two nodes as >>> said in Aaron's email back to the client and read -repair was kicked off >>> because of the inconsistent values and the write did not complete yet and >>> I guess you would have two nodes go dow
Re: Hinted Handoff storage inflation
On Oct 24, 2012, at 6:05 PM, aaron morton wrote: > Hints store the columns, row key, KS name and CF id(s) for each mutation to > each node. Where an executed mutation will store the most recent columns > collated with others under the same row key. So depending on the type of > mutation hints will take up more space. > > The worse case would be lots of overwrites. After that writing a small amount > of data to many rows would result in a lot of the serialised space being > devoted to row keys, KS name and CF id. > > 16Gb is a lot though. What was the write workload like ? Each write is new data only (no overwrites). Each mutation adds a row to one column family with a column containing about ~100 bytes of data and a new row to another column family with a SuperColumn containing 2x17KiB payloads. These are sent in batches with several in them, but I found that the storage overhead was the same regardless of the size of the batch mutation (i.e., 5 vs 25 mutations made no difference). A total of 1,000,000 mutations like these are sent over the duration of the test. > You can get an estimate on the number of keys in the Hints CF using nodetool > cfstats. Also some metrics in the JMX will tell you how many hints are > stored. > >> This has a huge impact on write performance as well. > Yup. Hints are added to the same Mutation thread pool as normal mutations. > They are processed async to the mutation request but they still take > resources to store. > > You can adjust how long hints a collected for with max_hint_window_in_ms in > the yaml file. > > How long did the test run for ? > With both data centers functional, the test takes just a few minutes to run, with one data center down, 15x the amount of time. /dml