Re: Replacing Redis
Thanks for replying, but it is not at all clear to me that redis was ever the right solution to the problem at hand. I'm still working out whether, in fact, Cassandra is. Up to a fairly large scale, this particular problem needs no 'cache' at all. It needs a write-behind persistence mechanism -- a log. The simplest form of which is, of course, just a file, though that solution poses problems for backup and some failure scenarios. There are two parts to the data footprint, one growing more rapidly than the other. The faster growing part has a natural 'redis-friendly' data model (keys map to values); but it would map to EHCache as well. However, the initial redis/jredis implementation used redis, rather than local ordinary JVM objects, for the slower growing part. The lpush/ltrim usage was in here. At really gigantic scales, which I don't claim to have entirely thought through, even the slow-growing part will stop fitting into memory. However, it has a natural architecture for distribution, so there is a dilemma in my head between following the logic of the algorithm out to implement distribution, or letting something like cassandra do that for me.
Re: Health of the cluster
Have a read of this page http://wiki.apache.org/cassandra/Operations Aaron On 18/02/2011, at 12:40 PM, mcasandra wrote: > > I have just started setting up Hector. I have 2 node cluster running in > "foreground" for my test so that I can watch everything. One big question > came to my mind that as I increase the cluster how do I verify the health of > the cluster. How do I see that it is all good. > > Since I am new I can't quantify from what I mean by in good health. But I > guess what I mean is that key distributions are fine. > > CF levels are being maintained for the keys. > > Check consistency, how many stale or old object versions are sitting around. > > Also, do I need to repair/cleanup anything etc. > > I have been using Oracle so far so in Oracle for instance I look at ADDM/AWR > report or internal tables (dba*) tables that will give more insight. > > Here things become more complicated because of RF, hinted hand off etc. > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Health-of-the-cluster-tp6038092p6038092.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com.
Re: memory consuption
Thanks Peter for the extra detail. I thought there may have been something more mysterious going on. But it sounds like it was just the semantics of the term "use". Cheers Aaron On 18/02/2011, at 9:25 PM, Peter Schuller wrote: >> main argument for using mmap() instead of standard I/O is the fact >> that reading entails just touching memory - in the case of the memory >> being resident, you just read it - you don't even take a page fault >> (so no overhead in entering the kernel and doing a semi-context >> switch). > > Oh and in the case of Java/Cassandra, as Jonathan clued me in on > earlier, there is also the issue that byte[] arrays are mandated to be > zeroed when allocated which causes overhead typically because there > has to be a loop[1] somewhere writing a bunch of zeroes in, that > you're then just going to replace immediately. Mapping a file has no > such implications as long as you read directly from the underlying > direct ByteBuffer. > > [1] Not *necessarily*; a JVM could theoretically do byte[] allocations > in such a way that it already knows the contents is zeroed, but it > would be highly dependent on the GC/memory management technique used > by the JVM whether this is practical. (It just occurred to me that > Azul should get this for 'free' in their GC. Wonder if that's true.) > > -- > / Peter Schuller
Re: How to use NetworkTopologyStrategy
The best examples I know of are in the internal cli help, and conf/casandra.yaml Aaron On 19/02/2011, at 12:51 AM, Héctor Izquierdo Seliva wrote: > Hi! > > Can some body give me some hints about how to configure a keyspace with > NetworkTopologyStrategy via cassandra-cli? Or what is the preferred > method to do so? > > Thanks! >
Re: R and N
My understanding.. 1 read repair involves the coordinator sending a full data read to CL nodes, resolving the differences and sending writes back. For CL one this happens after returning, for higher CL this happens before. (my understanding of the internals of RR are a little rough though) 2 not sure 3) RR is not used in write, hinted handoff is. 4) e node responsible for the key is often the node asked for the full data of the request, the other nodes are asked for a digest of their response. However the dynamic snitch can re-order the nodes based on load. It's also the starting point when the partitioner is working which nodes replicas shoud be stored on. It's not a point of failure. 5) partitioner knows where the data was written to. http://thelastpickle.com/2011/02/07/Introduction-to-Cassandra/ Aaron On 19/02/2011, at 6:28 AM, Anthony John wrote: > K - let me state the facts first (As I see know them) > - I do not know the inner workings, so interpret my response with that > caveat. Although, at an architectural level, one should be able to keep > detailed implementation at bay > - Quorum is (N+!)/2 where N is the Replication Factor (RF) > - And consistency is a guarantee if R(ead) + W(rite) > RF (Which Quorum gives > you, but can be achieved via other permutations, depending on whether Read or > Write performance is desired) > > No getting to your questions:- > 1. If Read at Q is nondeterministic, it would likely have to read the other > (RF-Q) nodes to achieve Quorum on a deterministic value. At which point - > sync'ing all with writes should not be that expensive. But at what point > precisely the read is returned - do not know - you will have to look at the > code. IMO - at this level it should not matter. > 2. Should be at the granularity of data divergence > 3. Read Repair or Nodetool (which ever comes first) > 4. All peer - there is no primary. There might be a connected node - but no > special role/privileges > 5. Tries to Q - returns on deterministic read. If not - see (1) > 6. Writer supplies timestamp value - can be any value that makes sense within > the scope of data/application. > > HTH, > > -JA > > On Fri, Feb 18, 2011 at 10:28 AM, A J wrote: > Couple of more related questions: > > 5. For reads, does Cassandra first read N nodes or just the R nodes it > selects ? I am thinking unless it reads all the N nodes, how will it > know which node has the latest write. > > 6. Who decides the timestamp that gets inserted into the timestamp > field of every column. I would guess the coordinator node picks up its > system's timestamp. If that is true, the clocks on all the nodes > should be synchronized, right ? Otherwise conflict resolution cannot > be done correctly. > For a distributed system, this is not always possible. How do folks > get around this issue ? > > Thanks. > > > > On Fri, Feb 18, 2011 at 10:23 AM, A J wrote: > > Questions about R and N (and W): > > 1. If I set R to Quorum and cassandra identifies a need for read > > repair before returning, would the read repair happen on R nodes (I > > mean subset of R that needs repair) or N nodes before the data is > > delivered to the client ? > > 2. Also does the repair happen at level of row (key) or at level of column ? > > > > 3. During write, if W is met but N-W is not met for some reason; would > > cassandra try to repair N-W nodes in the background as and when it > > can. Or the N-W are only repaired when a read is issued ? > > > > 4. What is the significance of the 'primary' replica for writes from > > usage point ? Writes to primary and non-primary replicas all happen > > simultaneously. Ensuring W is decided irrespective of it being primary > > or not. Ensuring R is decided by any of the R nodes out of N. > > I know the tokens are divided per the primary replica. But other than > > that, for read and write operations, do the primary replica play any > > special role ? > > > > Thanks. > > >
Re: frequent client exceptions on 0.7.0
AFAIK the MemtablePostFlusher is the TP writing sstables, if it has a queue then there is the potential for writes to block while it waits for Memtables to be flushed. Take a look at your Memtable settings per CF, could it be that all the Memtables are flushing at once? There is info in the logs about when this happens. One approach is to set the timeout high, so they are more likely to flush due to ops or throughput. Aaron On 19/02/2011, at 10:09 AM, Andy Skalet wrote: > On Thu, Feb 17, 2011 at 12:22 PM, Aaron Morton > wrote: >> Messages been dropped means the machine node is overloaded. Look at the >> thread pool stats to see which thread pools have queues. It may be IO >> related, so also check the read and write latency on the CF and use iostat. >> >> i would try those first, then jump into GC land. > > Thanks, Aaron. I am looking at the thread pool queues; not enough > data on that yet but so far I've seen queues in the ReadStage from > 4-30 (once 100) and MemtablePostFlusher as much as 70, though not > consistently. > > The read latencies on the CFs on this cluster are sitting around > 20-40ms, and the write latencies are are all around .01ms. That seems > good to me, but I don't have a baseline. > > I do see high (90-100%) utilization from time to time on the disk that > holds the data, based on reads. This doesn't surprise me too much > because IO on these machines is fairly limited in performance. > > Does this sound like the node is overloaded? > > Andy
Re: Async write
AFAIK the write across the dc is sync as well. I think there I'd a ticket out there to make them asnc, take a look at Jira. Aaron On 19/02/2011, at 10:24 AM, Anthony John wrote: > Fact as i understand them:- > - A write call to db triggers a number of async writes to all nodes where the > particular write should be recorded (and the nodes are up per Gossip and so > on) > - Once desired CL number of writes acknowledge - the call returns > > So your issue is moot. That is what is happening under the covers! > > > On Fri, Feb 18, 2011 at 3:08 PM, mcasandra wrote: > > So does it mean there is no way to say use sync + async ? I am thinking if I > have to write accross data center and doing it synchronuosly is going to be > very slow and will be bad for clients to have to wait. What are my options > or alternatives? > > Use N=3 and W=2? And the 3rd one (assuming will be async) will be in other > DC? How do I set it up or possible to setup? > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Async-write-tp6041440p6041479.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com. >
Re: Timeout
Some of the schema operations wait to check agreement between the nodes before proceeding. Are there any messages in your server side logs. Aaron On 19/02/2011, at 2:47 PM, mcasandra wrote: > > Forgot to mention replication factor is 1 and I am running Cassandra 0.7.0. > It's using SimpleStrategy > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-tp6042052p6042150.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com.
Re: Async write
On Sun, Feb 20, 2011 at 1:36 PM, Aaron Morton wrote: > > On 19/02/2011, at 10:24 AM, Anthony John wrote: > > > > Fact as i understand them:- > > - A write call to db triggers a number of async writes to all nodes where > > the particular write should be recorded (and the nodes are up per Gossip and > > so on) > > - Once desired CL number of writes acknowledge - the call returns > > AFAIK the write across the dc is sync as well. Anthony was correct: writes are asynchronous at the lowest level, and then we pick which to block for based on CL. > I think there I'd a ticket out there to make them asnc, take a look at Jira. You may be thinking of https://issues.apache.org/jira/browse/CASSANDRA-1530, which was completed for 0.7.1 but didn't change a/sync behavior. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: read latency in cassandra
Thanks for the probing questions. Just got around to looking into this a bit more and discovered it's not exactly what I thought. It's actually a single multiget_slice with a fairly slow response time. I've answered as many of the questions as I can, and have also included a reply to Rob's statements below. In short, I don't fully understand the read latency metrics from JMX but they don't seem to line up well with my client-side latency measurements. All numeric stats are from a dev environment which has smaller use+data volume than production but still exhibits long multiget_slice waits leading to client timeouts. Responses inline: On Fri, Feb 4, 2011 at 3:02 PM, aaron morton wrote: > What operation are you calling ? Are you trying to read the entire row > back? > It's a multiget_slice of 10-200 rows from a single CF. Reading two columns for each row, which is the entire row entry. > How many SSTables do you have for the CF? Does your data have a lot of > overwrites ? Have you modified the default compaction settings ? Compaction settings are default AFAIK. However is currently a 1-node cluster, so it's running with gc_grace_seconds=0. If overwrite means an update (?), yes--though the data will always be the same, we do write the same entry repeatedly as a way of refreshing its TTL. For some entries, this may be multiple times per minute. LiveDiskSpaceUsed: 1927107766 LiveSSTableCount: 6 > Do you have row cache enabled ? > No. Given the fairly large set of data and occasional random access, I doubt that row cache will be a good solution to this problem? Here's the settings: ColumnFamily: Backtraces Columns sorted by: org.apache.cassandra.db.marshal.BytesType Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 0.3/64/60 GC grace seconds: 0 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [] Can you use JConsole to check the read latency for the CF? > Used a jython JMX script I found on the mailing list. Output is below, though I'm not sure how to interpret the histograms. I take it RecentReadLatency is the very last observed request? If so, this was for a multiget of 99 items. Or is that for one particular get in the multiget? objectname= org.apache.cassandra.db:type=ColumnFamilies,keyspace=Tracelytics,columnfamily=Backtraces PendingTasks: 0 MemtableFlushAfterMins: 60 MemtableOperationsInMillions: 0.3 MinRowSize: 310 MaxRowSize: 43388628 MeanRowSize: 1848 ColumnFamilyName: Backtraces MemtableColumnsCount: 53880 MemtableDataSize: 51052251 MemtableSwitchCount: 470 RecentSSTablesPerReadHistogram: array('l',[34L, 0L, 6L, 12L, 10L, 101L, 106L, 160L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L]) SSTablesPerReadHistogram: array('l',[1747L, 2196L, 1319L, 578L, 1300L, 5815L, 2108L, 1649L, 13L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L]) ReadCount: 16725 RecentReadLatencyMicros: 133399.0536130536 LifetimeReadLatencyHistogramMicros: array('l',[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 38L, 52L, 57L, 50L, 92L, 166L, 289L, 283L, 318L, 223L, 137L, 108L, 65L, 57L, 287L, 549L, 408L, 357L, 538L, 734L, 883L, 548L, 478L, 329L, 244L, 226L, 168L, 262L, 291L, 290L, 355L, 611L, 965L, 836L, 694L, 381L, 136L, 137L, 178L, 326L, 181L, 140L, 188L, 151L, 146L, 146L, 105L, 130L, 102L, 106L, 130L, 135L, 144L, 149L, 173L, 201L, 191L, 188L, 208L, 203L, 155L, 122L, 74L, 45L, 26L, 18L, 15L, 5L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L]) RecentReadLatencyHistogramMicros: array('l',[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 2L, 8L, 8L, 5L, 4L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 4L, 0L, 0L, 0L, 4L, 9L, 3L, 4L, 8L, 13L, 24L, 19L, 32L, 51L, 22L, 17L, 7L, 1L, 1L, 1L, 1L, 4L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 5L, 12L, 9L, 11L, 8L, 8L, 15L, 13L, 20L, 12L, 15L, 13L, 7L, 5L, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L]) TotalReadLatencyMicros: 1121884414 WriteCount: 10944316 TotalWriteLatencyMicros: 386100317 RecentWriteLatencyMicros: 34.614922206506364 LifetimeWriteLatencyHistogramMicros: array('l',[0L, 9L, 1181L, 5950L, 12524L, 19751L, 27521L, 34963L, 96032L, 158925L, 278830L, 615547L, 664807L, 951370L, 1334996L, 1652907L, 1745019L, 1517550L, 1091827L, 520153L, 164867L, 34155L, 6750L, 3239L, 1890L, 909L, 534L, 398L, 285L, 232L, 179L, 159L, 133L, 130L, 85L, 71L, 64L, 42L, 40L, 27L, 40L, 40L, 33L, 15L, 8L, 12L, 8L, 7L, 7L, 11L, 10L, 8L, 5L, 7L, 5L, 12L, 19L, 11L, 5L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L]) RecentWriteLatencyHistogramMicros: array('l',[0L, 0L, 0L, 0L, 0L, 2L, 3L, 3L, 17L, 127L, 228L, 526L, 544L, 967L, 1521L, 1953L, 2177L, 1801L, 1043L, 333L, 47L, 6L, 7L, 1L, 2L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0
Re: Async write
Sorry, was typing faster than my brain was working. I was trying to say at the higher level the request would block for the write operations in other DC's to complete. As you say.That was the ticket I was thinking of. ThanksAaron On 21 Feb, 2011,at 01:19 PM, Jonathan Ellis wrote:On Sun, Feb 20, 2011 at 1:36 PM, Aaron Mortonwrote: > > On 19/02/2011, at 10:24 AM, Anthony John wrote: > > > > Fact as i understand them:- > > - A write call to db triggers a number of async writes to all nodes where > > the particular write should be recorded (and the nodes are up per Gossip and > > so on) > > - Once desired CL number of writes acknowledge - the call returns > > AFAIK the write across the dc is sync as well. Anthony was correct: writes are asynchronous at the lowest level, and then we pick which to block for based on CL. > I think there I'd a ticket out there to make them asnc, take a look at Jira. You may be thinking of https://issues.apache.org/jira/browse/CASSANDRA-1530, which was completed for 0.7.1 but didn't change a/sync behavior. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Replacing Redis
On Sun, 2011-02-20 at 09:04 -0500, Benson Margulies wrote: > There are two parts to the data footprint, one growing more rapidly > than the other. The faster growing part has a natural 'redis-friendly' > data model (keys map to values); but it would map to EHCache as well. If you're coming into the non-relational world with the perspective that the main consideration is your "data model," you're approaching the problem from the wrong direction. The non-relational world isn't about BCNF or foreign keys. Storing and retrieving individual items is the easy part; we need to understand how you intend to *use* the data to help you understand if Cassandra is a good fit. * How do you add new data? * Other than accessing individual items by their primary key (or equivalent), how do you read it? * What needs to be real-time? Asynchronous? * What correctness or consistency constraints can you relax? * What availability needs do you have? * What programming languages are you working in? > However, the initial redis/jredis implementation used redis, rather > than local ordinary JVM objects, for the slower growing part. The > lpush/ltrim usage was in here. It's not clear just from your use of lpush/ltrim what you're doing. Some uses can map to Cassandra's APIs with some creative reinterpretation; others can't. -- David Strauss | da...@davidstrauss.net | +1 512 577 5827 [mobile] signature.asc Description: This is a digitally signed message part
Re: Understand eventually consistent
On Fri, 2011-02-18 at 12:01 -0600, Anthony John wrote: > Writes will go thru w/hinted handoff, read will fail That is not correct. Hinted handoffs do not count toward reaching QUORUM counts.[1] [1] http://wiki.apache.org/cassandra/HintedHandoff -- David Strauss | da...@davidstrauss.net | +1 512 577 5827 [mobile] signature.asc Description: This is a digitally signed message part
Re: Are row-keys sorted by the compareWith?
Hi Carlo, As Jonathan mentions the compareWith on a column family def. is defines the order for the columns *within* a row... In order to control the ordering of rows you'll need to use the OrderPreservingPartitioner (http://www.datastax.com/docs/0.7/operations/clustering#tokens-partitioners-ring). As for getColumnsFromRows; it should be returning you a map of lists. The map is insertion-order-preserving and populated based on the provided list of row keys (so if you iterate over the entries in the map they should be in the same order as the list of row keys). The list for each row entry are definitely in the order that Cassandra provides them, take a look at org.scale7.cassandra.pelops.Selector#toColumnList if you need more info. Cheers, Dan -- Dan Washusen Sent with Sparrow On Saturday, 19 February 2011 at 8:16 AM, cbert...@libero.it wrote: > Hi all, > I created a CF in which i need to get, sorted by time, the Rows inside. Each > Row represents a comment. > > > > I've created a few rows using as Row Key a generated TimeUUID but when I call > the Pelops method "GetColumnsFromRows" I don't get the data back as I expect: > rows are not sorted by TimeUUID. > I though it was probably cause of the random-part of the TimeUUID so I create > a new CF ... > > > > This time I created a few rows using the java System.CurrentTimeMillis() that > retrieve a long. I call again the "GetColumnsFromRows" and again the same > results: data are not sorted! > I've read many times that Rows are sorted as specified in the compareWith but > I can't see it. > To solve this problem for the moment I've used a SuperColumnFamily with an > UNIQUE ROW ... but I think this is just a workaround and not the solution. > > CompareSubcolumnsWith="BytesType"/ > > > Now when I call the "GetSuperColumnsFromRow" I get all the SuperColumns as I > expected: sorted by TimeUUID. Why it does not happen the same with the Rows? > I'm confused. > > TIA for any help. > > Best Regards > > Carlo >
Broken image links in Wiki:MemtableThresholds
Hello folks, I'm tranlating the Wiki pages to Japanese now. I found all of images in MemtableThresholds are broken: http://wiki.apache.org/cassandra/MemtableThresholds Can anyone fix the links? Thanks -- maki
RE: Broken image links in Wiki:MemtableThresholds
I think apache infra team is working on the issue... https://issues.apache.org/jira/browse/INFRA-3352 -Original Message- From: Maki Watanabe [mailto:watanabe.m...@gmail.com] Sent: Monday, February 21, 2011 1:20 PM To: user@cassandra.apache.org Subject: Broken image links in Wiki:MemtableThresholds Hello folks, I'm tranlating the Wiki pages to Japanese now. I found all of images in MemtableThresholds are broken: http://wiki.apache.org/cassandra/MemtableThresholds Can anyone fix the links? Thanks -- maki
Re: Broken image links in Wiki:MemtableThresholds
Ok, I got it. 2011年2月21日13:37 : > > I think apache infra team is working on the issue... > > https://issues.apache.org/jira/browse/INFRA-3352 > > -Original Message- > From: Maki Watanabe [mailto:watanabe.m...@gmail.com] > Sent: Monday, February 21, 2011 1:20 PM > To: user@cassandra.apache.org > Subject: Broken image links in Wiki:MemtableThresholds > > Hello folks, > I'm tranlating the Wiki pages to Japanese now. > I found all of images in MemtableThresholds are broken: > http://wiki.apache.org/cassandra/MemtableThresholds > > Can anyone fix the links? > > Thanks > -- > maki > -- w3m