Nodes Flapping in the RIng
We have a new 6-node cluster running 0.6.13 (Due to some client side issues we need to be on 0.6x for time being) that we are injecting data into and ran into some issues with nodes going down and then up quickly in the ring. All nodes are effected and we have rules out the network layer. It happens on all nodes and seems related to GC or mtable flushes. We had things stable but after a series of data migrations we saw some swapping so we tuned to max heap down and this helped with swapping but the flapping still persists. The systems have 6-cores and 24 GB ram, max heap is at 12G. We are using the Parallel GC colector for throughput. Our run file for starting cassandra looks like this: exec 2>&1 ulimit -n 262144 cd /opt/cassandra-0.6.13 exec chpst -u cassandra java \ -ea \ -Xms4G \ -Xmx12G \ -XX:TargetSurvivorRatio=90 \ -XX:+PrintGCDetails \ -XX:+AggressiveOpts \ -XX:+UseParallelGC \ -XX:+CMSParallelRemarkEnabled \ -XX:SurvivorRatio=128 \ -XX:MaxTenuringThreshold=0 \ -Djava.rmi.server.hostname=10.20.3.155 \ -Dcom.sun.management.jmxremote.port=8080 \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcassandra-foreground=yes \ -Dstorage-config=/etc/cassandra \ -cp '/etc/cassandra:/opt/cassandra-0.6.13/lib/*' \ org.apache.cassandra.thrift.CassandraDaemon <&- Our storage conf like this for the mem/disk stuff: mmap 4 64 32 64 16 64 256 0.3 60 12 32 periodic 1 864000 true Any thoughts on this would be really interesting. -- Jake Maizel Head of Network Operations Soundcloud Mail & GTalk: j...@soundcloud.com Skype: jakecloud Rosenthaler strasse 13, 101 19, Berlin, DE
Re: Off-heap caching through ByteBuffer.allocateDirect when JNA not available ?
Thanks for the answer. I saw the move to sun.misc. In what sense allocateDirect is broken ? Thanks, Benoit. 2011/11/9 Jonathan Ellis : > allocateDirect is broken for this purpose, but we removed the JNA > dependency using sun.misc.Unsafe instead: > https://issues.apache.org/jira/browse/CASSANDRA-3271 > > On Wed, Nov 9, 2011 at 5:54 AM, Benoit Perroud wrote: >> Hi, >> >> I wonder if you have already discussed about ByteBuffer.allocateDirect >> alternative to JNA memory allocation ? >> >> If so, do someone mind send me a pointer ? >> >> Thanks ! >> >> Benoit. >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com > -- sent from my Nokia 3210
OOM : key caches, mmap ?
Hi, I faced a similar issue as described there : http://comments.gmane.org/gmane.comp.db.cassandra.user/11184. I was running Cassandra 1.0.0 with a 3 node cluster on 3 t1.micro from Amazon EC2. I have no error in cassandra logs, but an OOM in /var/log/kern.log which put one of my nodes down. After some search I have seen two possible reason for it : 1 - I may be keeping too much keys cached (200 000 which is default). I read that now this cache is out of the heap, my heap is about 300 M, the HEAP_NEWSIZE is about 75M, the system use about 150M all this out of 600M What happens when I try to cache a row key and my memory is already full ? How to choose the number of keys cached ? I saw that it is hard to calculate the amount of memory used by this cache. Should I start by caching 2,000 rows per CF and update my schema increasing progresively the amount of key cached if I see that it works fine ? 2 - I read a lot about mmap without understanding clearly if I should keep it "auto" or "mmap_index_only". Anyways, I didn't find the "disk_access_mode" option in my cassandra.yaml, which let me think that it was removed and I have no other choice to keep "auto" value. Thank you, Alain
Re: Data model for counting uniques online
what about keeping a record per device and recording there that you've seen it, and only incrementing the counters (or a different set of counters) based on that? On Wed, Nov 9, 2011 at 6:09 PM, Philippe wrote: > Hello, I'd like to get some ideas on how to model counting uniques with > cassandra. > My use-case is that I have various counters that I increment based on data > received from multiple devices. I'd like to be able to know if at least X > unique devices contributed to a counter value. > I've thought of the idea of building a key per counter and then using > columns whose name would be the device ids but it might grow very very > large and I don't see how to prune it effeciently. > Any ideas ? > > Thanks >
RE: propertyfilesnitch problem
At first, I was also thinking that one or more nodes in the cluster are broken or not responding. But through nodetool cfstats, it looks like all the nodes are working as expected and pings gives me the expected inter-node latencies. Also the scores calculated by dynamic snitch in the steady state seem to correspond to how we configure the network topology. We're not timing out, but comparing periods when the dynamic snitch has appropriate scores and when it doesn't, the latency of LOCAL_QUORUM operations gets bumped up from ~10ms to ~100ms. Quorum operations remain at ~100ms regardless of dynamic snitch settings. We maintain the consistent load through the tests and there are no feedback mechanisms. Thanks, Shu From: sc...@scode.org [sc...@scode.org] On Behalf Of Peter Schuller [peter.schul...@infidyne.com] Sent: Wednesday, November 09, 2011 11:07 PM To: user@cassandra.apache.org Subject: Re: propertyfilesnitch problem > 2. With the same setup, after each period as defined by > dynamic_snitch_reset_interval_in_ms, the LOCAL_QUORUM performance greatly > degrades before drastically improving again within a minute. This part sounds to me like one or more nodes in the cluster are either broken and not responding at all, or overloaded. Restarts will tend to temporarily cause additional pressure on nodes (particularly I/O due to cache eviction issues). Because the dynamic snitch won't ever know that the node is slow (after a reset) until requests start actually timing out, it can be up to rpc_timeout second before it gets snitched away. That sounds like what you're seeing. On ever reset, an rpc_timeout period of poor latency for clients. Is rpc_timeout 60 seconds? > 4. With dynamic snitch turned on, QUORUM operations' performance is about the > same as using LOCAL_QUORUM when the dynamic snitch is off or the first minute > after a restart with the snitch turned on. This is strange, unless it is co-incidental. Can you be more specific about the performance characteristics you're seeing when degraded? For example: * High latency, or timeouts? * Are you getting Unavailable exceptions? * Are you maintaining the same overall throughput or is there a feedback mechanism such that when queries have high latency the request rate decreases? * Which data points are you using to consider something degraded? What's matching in the QUORUM and LOCAL_QUOROM w/o dynsnitch cases? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
indexes from CassandraSF
hi, i've been looking at the model below from Ed Anuff's presentation at Cassandra CF (http://www.slideshare.net/edanuff/indexing-in-cassandra). Couple of questions: 1) Isn't there still the chance that two concurrent updates may end up with the index containing two entries for the given user, only one of which would be match the actual value in the Users cf? 2) What happens if your batch fails partway through the update? If i understand correctly there are no guarantees about ordering when a batch is executed, so isn't it possible that eg the previous value entries in Users_Index_Entries may have been deleted, and then the batch fails before the entries in Indexes are deleted, ie the mechanism has 'lost' those values? I assume this can be addressed by not deleting the old entries until the batch has succeeded (ie put the previous entry deletion into a separate, subsequent batch). this at least lets you retry at a later time. perhaps i'm missing something? SELECT {"location"}..{"location", *} FROM Users_Index_Entries WHERE KEY = ; BEGIN BATCH DELETE {"location", ts1}, {"location", ts2}, ... FROM Users_Index_Entries WHERE KEY = ; DELETE {, , ts1}, {, , ts2}, ... FROM Indexes WHERE KEY = "Users_By_Location"; UPDATE Users_Index_Entries SET {"location", ts3} = WHERE KEY=; UPDATE Indexes SET {, , ts3) = null WHERE KEY = "Users_By_Location"; UPDATE Users SET location = WHERE KEY = ; APPLY BATCH
1.0.2 Assertion Error
Just saw this weird assertion after upgrading one of my nodes from 0.8.6 to 1.0.2 (its been running fine for a few hours now): INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237) Writing Memtable-Data@1388955390(25676955/430716097 serialized/live bytes, 478913 ops) ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[FlushWriter:9,5,main] java.lang.AssertionError: CF size changed during serialization: was 4 initially but 3 written at org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFam ilySerializer.java:94) at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFa milySerializer.java:112) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47) at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) at java.lang.Thread.run(Thread.java:662) After taking a quick peek at the code, it looks like the numbers ("4 initially but 3 written") refer to the number of columns (in the memtable?). Given the byte size of the memtable being flushed, there are *certainly* more than 4 columns. Besides this error, there are no other unexpected or unusual log entries and the node seems to be behaving normally. Thoughts? Dan Hendry
Bug?
All, In 0.8.6 I got myself into a bit of a fix. First I tried to drop a column family. This failed because I didn't have JNA installed (known and documented). To fix this I drained the node, stopped the process, installed JNA, and restarted C*. Unfortunately this lead to an inconsistency in schema versions. Apparently the schema was successfully modified locally during the drop command even though the snapshot failed. In system.log HH then began to fail: ERROR [HintedHandoff:4] 2011-11-10 19:24:35,628 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[HintedHandoff:4,1,main] java.lang.RuntimeException: java.lang.RuntimeException: Could not reach schema agreement with /10.252.229.235 in 6ms at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.lang.RuntimeException: Could not reach schema agreement with /10.252.229.235 in 6ms at org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:293) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:304) at org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:89) at org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:397) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) This edge case might be prevented if the schema was updated as the final step of a DROP. Thoughts on if this is worth submitting as a bug? Ian
RE: 1.0.2 Assertion Error
Just happened again, seems to be with the same column family (at least on a flusher thread for which the last activity was flushing a memtable for that CF). It also looks like MemtablePostFlusher tasks blocked (and not getting cleared) as evidenced by tpstats: Pool NameActive Pending Completed Blocked All time blocked MemtablePostFlusher 118 16 0 0 Dan From: Dan Hendry [mailto:dan.hendry.j...@gmail.com] Sent: November-10-11 14:49 To: 'user@cassandra.apache.org' Subject: 1.0.2 Assertion Error Just saw this weird assertion after upgrading one of my nodes from 0.8.6 to 1.0.2 (its been running fine for a few hours now): INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237) Writing Memtable-Data@1388955390(25676955/430716097 serialized/live bytes, 478913 ops) ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[FlushWriter:9,5,main] java.lang.AssertionError: CF size changed during serialization: was 4 initially but 3 written at org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFam ilySerializer.java:94) at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFa milySerializer.java:112) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47) at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) at java.lang.Thread.run(Thread.java:662) After taking a quick peek at the code, it looks like the numbers ("4 initially but 3 written") refer to the number of columns (in the memtable?). Given the byte size of the memtable being flushed, there are *certainly* more than 4 columns. Besides this error, there are no other unexpected or unusual log entries and the node seems to be behaving normally. Thoughts? Dan Hendry
Re: 1.0.2 Assertion Error
That would be a bug (as any assertion error would be), likely some race condition. Could you open a ticket? The fact that this block the MemtablePostFlusher is unfortunately related. Restarting the node would fix but we need to make that more solid too. -- Sylvain On Thu, Nov 10, 2011 at 9:04 PM, Dan Hendry wrote: > Just happened again, seems to be with the same column family (at least on a > flusher thread for which the last activity was flushing a memtable for that > CF). > > > > It also looks like MemtablePostFlusher tasks blocked (and not getting > cleared) as evidenced by tpstats: > > > > Pool Name Active Pending Completed Blocked All > time blocked > > MemtablePostFlusher 1 18 16 > 0 0 > > > > Dan > > > > From: Dan Hendry [mailto:dan.hendry.j...@gmail.com] > Sent: November-10-11 14:49 > To: 'user@cassandra.apache.org' > Subject: 1.0.2 Assertion Error > > > > Just saw this weird assertion after upgrading one of my nodes from 0.8.6 to > 1.0.2 (its been running fine for a few hours now): > > > > INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237) > Writing Memtable-Data@1388955390(25676955/430716097 serialized/live bytes, > 478913 ops) > > ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java > (line 133) Fatal exception in thread Thread[FlushWriter:9,5,main] > > java.lang.AssertionError: CF size changed during serialization: was 4 > initially but 3 written > > at > org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:94) > > at > org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:112) > > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177) > > at > org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264) > > at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47) > > at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289) > > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > > > > > > After taking a quick peek at the code, it looks like the numbers (“4 > initially but 3 written”) refer to the number of columns (in the memtable?). > Given the byte size of the memtable being flushed, there are *certainly* > more than 4 columns. Besides this error, there are no other unexpected or > unusual log entries and the node seems to be behaving normally. > > > > Thoughts? > > > > Dan Hendry
Data retrieval inconsistent
I am facing an issue in 0.8.7 cluster - - I have two clusters in two DCs (rather one cross dc cluster) and two keyspaces. But i have only configured one keyspace to replicate data to the other DC and the other keyspace to not replicate over to the other DC. Basically this is the way i ran the keyspace creation - create keyspace K1 with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = [{replication_factor:1}]; create keyspace K2 with placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = [{DC1:2, DC2:2}]; I had to do this because i expect that K1 will get a large volume of data and i do not want this wired over to the other DC. I am writing the data at CL=ONE and reading the data at CL=ONE. I am seeing an issue where sometimes i get the data and other times i do not see the data. Does anyone know what could be going on here? A second larger question is - i am migrating from 0.7.4 to 0.8.7 , i can see that there are large changes in the yaml file, but a specific question i had was - how do i configure disk_access_mode like it used to be in 0.7.4? One observation i have made is that some nodes of the cross dc cluster are at different system times. This is something to fix but could this be why data is sometimes retrieved and other times not? Or is there some other thing to it? Would appreciate a quick response.
Re: Data retrieval inconsistent
On Thu, Nov 10, 2011 at 3:27 PM, Subrahmanya Harve < subrahmanyaha...@gmail.com> wrote: > I am facing an issue in 0.8.7 cluster - > > - I have two clusters in two DCs (rather one cross dc cluster) and two > keyspaces. But i have only configured one keyspace to replicate data to the > other DC and the other keyspace to not replicate over to the other DC. > Basically this is the way i ran the keyspace creation - >create keyspace K1 with > placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and > strategy_options = [{replication_factor:1}]; >create keyspace K2 with > placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' > and strategy_options = [{DC1:2, DC2:2}]; > > I had to do this because i expect that K1 will get a large volume of data > and i do not want this wired over to the other DC. > > I am writing the data at CL=ONE and reading the data at CL=ONE. I am seeing > an issue where sometimes i get the data and other times i do not see the > data. Does anyone know what could be going on here? > > A second larger question is - i am migrating from 0.7.4 to 0.8.7 , i can > see that there are large changes in the yaml file, but a specific question > i had was - how do i configure disk_access_mode like it used to be in > 0.7.4? > > One observation i have made is that some nodes of the cross dc cluster are > at different system times. This is something to fix but could this be why > data is sometimes retrieved and other times not? Or is there some other > thing to it? > > Would appreciate a quick response. > disk_access_mode is no longer required. Cassandra attempts to detect 64 bit and set this correctly. If your systems are not running NTP and as a result their clocks are not in sync this can cause issues (with ttl columns). Your client machines must have correct time as well.
RE: 1.0.2 Assertion Error
https://issues.apache.org/jira/browse/CASSANDRA-3482 I restarted the node and the problem has cropped up again. Is it possible to downgrade back to 0.8? Is there any way to convert 'h' version SSTables to the old 'g' version? Any other data file changes to be aware of? Dan -Original Message- From: Sylvain Lebresne [mailto:sylv...@datastax.com] Sent: November-10-11 15:23 To: user@cassandra.apache.org Subject: Re: 1.0.2 Assertion Error That would be a bug (as any assertion error would be), likely some race condition. Could you open a ticket? The fact that this block the MemtablePostFlusher is unfortunately related. Restarting the node would fix but we need to make that more solid too. -- Sylvain On Thu, Nov 10, 2011 at 9:04 PM, Dan Hendry wrote: > Just happened again, seems to be with the same column family (at least on a > flusher thread for which the last activity was flushing a memtable for that > CF). > > > > It also looks like MemtablePostFlusher tasks blocked (and not getting > cleared) as evidenced by tpstats: > > > > Pool Name Active Pending Completed Blocked All > time blocked > > MemtablePostFlusher 1 18 16 > 0 0 > > > > Dan > > > > From: Dan Hendry [mailto:dan.hendry.j...@gmail.com] > Sent: November-10-11 14:49 > To: 'user@cassandra.apache.org' > Subject: 1.0.2 Assertion Error > > > > Just saw this weird assertion after upgrading one of my nodes from 0.8.6 to > 1.0.2 (its been running fine for a few hours now): > > > > INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237) > Writing Memtable-Data@1388955390(25676955/430716097 serialized/live bytes, > 478913 ops) > > ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java > (line 133) Fatal exception in thread Thread[FlushWriter:9,5,main] > > java.lang.AssertionError: CF size changed during serialization: was 4 > initially but 3 written > > at > org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFam ilySerializer.java:94) > > at > org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFa milySerializer.java:112) > > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177) > > at > org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264) > > at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47) > > at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289) > > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) > > at java.lang.Thread.run(Thread.java:662) > > > > > > After taking a quick peek at the code, it looks like the numbers (4 > initially but 3 written) refer to the number of columns (in the memtable?). > Given the byte size of the memtable being flushed, there are *certainly* > more than 4 columns. Besides this error, there are no other unexpected or > unusual log entries and the node seems to be behaving normally. > > > > Thoughts? > > > > Dan Hendry No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.920 / Virus Database: 271.1.1/4007 - Release Date: 11/10/11 02:34:00
Re: Data retrieval inconsistent
I am pretty sure the way you have K1 configured it will be placed across both DC's as if you had large ring. If you want it only in DC1 you need to say DC1:1, DC2:0. If you are writing and reading at ONE you are not guaranteed to get the data if RF > 1. If RF = 2, and you write with ONE, you data could be written to server 1, and then read from server 2 before it gets over there. The differing on server times will only really matter for TTL's. Most everything else works off comparing user supplied times. -Jeremiah On 11/10/2011 02:27 PM, Subrahmanya Harve wrote: I am facing an issue in 0.8.7 cluster - - I have two clusters in two DCs (rather one cross dc cluster) and two keyspaces. But i have only configured one keyspace to replicate data to the other DC and the other keyspace to not replicate over to the other DC. Basically this is the way i ran the keyspace creation - create keyspace K1 with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = [{replication_factor:1}]; create keyspace K2 with placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = [{DC1:2, DC2:2}]; I had to do this because i expect that K1 will get a large volume of data and i do not want this wired over to the other DC. I am writing the data at CL=ONE and reading the data at CL=ONE. I am seeing an issue where sometimes i get the data and other times i do not see the data. Does anyone know what could be going on here? A second larger question is - i am migrating from 0.7.4 to 0.8.7 , i can see that there are large changes in the yaml file, but a specific question i had was - how do i configure disk_access_mode like it used to be in 0.7.4? One observation i have made is that some nodes of the cross dc cluster are at different system times. This is something to fix but could this be why data is sometimes retrieved and other times not? Or is there some other thing to it? Would appreciate a quick response.
Not all nodes see the complete ring
I’m curious if anyone has ever seen this happen or has any idea how it would happen. I have a 10 cluster node with 5 nodes in each data center running .6 (we're working on the upgrade now). I had several nodes with forgotten deletes so I failed the nodes and bootstrapped them back into the cluster one at a time. Everything seemed fine, but now I’m noticing that all systems in 1 of my data centers see all 10 nodes and all systems in the other data center see just 9. I’m figuring now is the time to fail the node that only half the other nodes can see, but what would cause this to happen? -Tim Smith DC1 10.x.x.45 Up162065212751151145161126595807335373 |<--| 10.x.x.44 Up 5452449782323250074504667089218893518 | ^ 10.x.x.43 Up 8114257989534302620064490155463988554 v | 10.x.x.46 Up 21422567192334579300859480282267974118 | ^ 10.x.x.60 Up 54861697885175209049354960363878287097 v | 10.x.x.69 Up154328840302872203985032035664154382201| ^ 10.x.x.62 Up156995951391754654763374624484548356765v | 10.x.x.61 Up 158321671343439891722659169797597266747| ^ 10.x.x.47 Up 159657096314745030420102789742477598562|-->| DC2 10.x.x.45 Up 162065212751151145161126595807335373 |<--| 10.x.x.44 Up 5452449782323250074504667089218893518 | ^ 10.x.x.43 Up 8114257989534302620064490155463988554 v | 10.x.x.46 Up21422567192334579300859480282267974118 | ^ 10.x.x.60 Up 54861697885175209049354960363878287097 v | 10.x.x.71 Up 149032703168324939856639604911542585192| ^ 10.x.x.69 Up154328840302872203985032035664154382201v | 10.x.x.62 Up156995951391754654763374624484548356765| ^ 10.x.x.61 Up 158321671343439891722659169797597266747v | 10.x.x.47 Up159657096314745030420102789742477598562|-->|
Re: Data retrieval inconsistent
Thanks Ed and Jeremiah for that useful info. "I am pretty sure the way you have K1 configured it will be placed across both DC's as if you had large ring. If you want it only in DC1 you need to say DC1:1, DC2:0." Infact i do want K1 to be available across both DCs as if i had a large ring. I just do not want them to replicate over across DCs. Also i did try doing it like you said DC1:1, DC2:0 but wont that mean that, all my data goes into DC1 irrespective of whether the data is getting into the nodes of DC1 or DC2, thereby creating a "hot DC"? Since the volume of data for this case is huge, that might create a load imbalance on DC1? (Am i missing something?) On Thu, Nov 10, 2011 at 1:30 PM, Jeremiah Jordan < jeremiah.jor...@morningstar.com> wrote: > I am pretty sure the way you have K1 configured it will be placed across > both DC's as if you had large ring. If you want it only in DC1 you need to > say DC1:1, DC2:0. > If you are writing and reading at ONE you are not guaranteed to get the > data if RF > 1. If RF = 2, and you write with ONE, you data could be > written to server 1, and then read from server 2 before it gets over there. > > The differing on server times will only really matter for TTL's. Most > everything else works off comparing user supplied times. > > -Jeremiah > > > On 11/10/2011 02:27 PM, Subrahmanya Harve wrote: > >> >> I am facing an issue in 0.8.7 cluster - >> >> - I have two clusters in two DCs (rather one cross dc cluster) and two >> keyspaces. But i have only configured one keyspace to replicate data to the >> other DC and the other keyspace to not replicate over to the other DC. >> Basically this is the way i ran the keyspace creation - >>create keyspace K1 with placement_strategy='org.** >> apache.cassandra.locator.**SimpleStrategy' and strategy_options = >> [{replication_factor:1}]; >>create keyspace K2 with placement_strategy='org.** >> apache.cassandra.locator.**NetworkTopologyStrategy' and strategy_options >> = [{DC1:2, DC2:2}]; >> >> I had to do this because i expect that K1 will get a large volume of data >> and i do not want this wired over to the other DC. >> >> I am writing the data at CL=ONE and reading the data at CL=ONE. I am >> seeing an issue where sometimes i get the data and other times i do not see >> the data. Does anyone know what could be going on here? >> >> A second larger question is - i am migrating from 0.7.4 to 0.8.7 , i can >> see that there are large changes in the yaml file, but a specific question >> i had was - how do i configure disk_access_mode like it used to be in 0.7.4? >> >> One observation i have made is that some nodes of the cross dc cluster >> are at different system times. This is something to fix but could this be >> why data is sometimes retrieved and other times not? Or is there some other >> thing to it? >> >> Would appreciate a quick response. >> >
changing ownership when nodes have same token
Hello, I notice that when starting a new node with same configuration(cluster name, seeds, token etc) as an existing ring member, the new node will take over the ownership from existing ring member. Is this expected behavior? I would like to see Cassandra prevents new node from joining the ring as in this case new node is mis-configured. Feng
Re: Data retrieval inconsistent
No, that is what I thought you wanted. I was thinking your machines in DC1 had extra disk space or something... (I stopped replying to the dev list) On 11/10/2011 04:09 PM, Subrahmanya Harve wrote: Thanks Ed and Jeremiah for that useful info. "I am pretty sure the way you have K1 configured it will be placed across both DC's as if you had large ring. If you want it only in DC1 you need to say DC1:1, DC2:0." Infact i do want K1 to be available across both DCs as if i had a large ring. I just do not want them to replicate over across DCs. Also i did try doing it like you said DC1:1, DC2:0 but wont that mean that, all my data goes into DC1 irrespective of whether the data is getting into the nodes of DC1 or DC2, thereby creating a "hot DC"? Since the volume of data for this case is huge, that might create a load imbalance on DC1? (Am i missing something?) On Thu, Nov 10, 2011 at 1:30 PM, Jeremiah Jordan < jeremiah.jor...@morningstar.com> wrote: > I am pretty sure the way you have K1 configured it will be placed across > both DC's as if you had large ring. If you want it only in DC1 you need to > say DC1:1, DC2:0. > If you are writing and reading at ONE you are not guaranteed to get the > data if RF > 1. If RF = 2, and you write with ONE, you data could be > written to server 1, and then read from server 2 before it gets over there. > > The differing on server times will only really matter for TTL's. Most > everything else works off comparing user supplied times. > > -Jeremiah > > > On 11/10/2011 02:27 PM, Subrahmanya Harve wrote: > >> >> I am facing an issue in 0.8.7 cluster - >> >> - I have two clusters in two DCs (rather one cross dc cluster) and two >> keyspaces. But i have only configured one keyspace to replicate data to the >> other DC and the other keyspace to not replicate over to the other DC. >> Basically this is the way i ran the keyspace creation - >>create keyspace K1 with placement_strategy='org.** >> apache.cassandra.locator.**SimpleStrategy' and strategy_options = >> [{replication_factor:1}]; >>create keyspace K2 with placement_strategy='org.** >> apache.cassandra.locator.**NetworkTopologyStrategy' and strategy_options >> = [{DC1:2, DC2:2}]; >> >> I had to do this because i expect that K1 will get a large volume of data >> and i do not want this wired over to the other DC. >> >> I am writing the data at CL=ONE and reading the data at CL=ONE. I am >> seeing an issue where sometimes i get the data and other times i do not see >> the data. Does anyone know what could be going on here? >> >> A second larger question is - i am migrating from 0.7.4 to 0.8.7 , i can >> see that there are large changes in the yaml file, but a specific question >> i had was - how do i configure disk_access_mode like it used to be in 0.7.4? >> >> One observation i have made is that some nodes of the cross dc cluster >> are at different system times. This is something to fix but could this be >> why data is sometimes retrieved and other times not? Or is there some other >> thing to it? >> >> Would appreciate a quick response. >> >
Mass deletion -- slowing down
Hello, My data load comes in batches representing one day in the life of a large computing facility. I index the data by the day it was produced, to be able to quickly pull data for a specific day within the last year or two. There are 6 other indexes. When it comes to retiring the data, I intend to delete it for the oldest date and after that add a fresh batch of data, so I control the disk space. Therein lies a problem -- and it maybe Pycassa related, so I also filed an issue on github -- then I select by 'DATE=blah' and then do a batch remove, it works fine for a while, and then after a few thousand deletions (done in batches of 1000) it grinds to a halt, i.e. I can no longer iterate the result, which manifests in a timeout error. Is that a behavior seen before? Cassandra version is 0.8.6, Pycassa 1.3.0. TIA, Maxim
Is there a way to get only keys with get_indexed_slices?
Is there a way to get only keys with get_indexed_slices? Looking at the code, it's not possible, but -- is there some way anyhow? I don't want to extract any data, just a list of matching keys. TIA, Maxim
Re: Data retrieval inconsistent
Thanks. I'm gonna try and use QUORUM to read and/or write and see if data is returned consistently. On Thu, Nov 10, 2011 at 3:00 PM, Jeremiah Jordan < jeremiah.jor...@morningstar.com> wrote: > No, that is what I thought you wanted. I was thinking your machines in > DC1 had extra disk space or something... > > (I stopped replying to the dev list) > > > On 11/10/2011 04:09 PM, Subrahmanya Harve wrote: > > Thanks Ed and Jeremiah for that useful info. > "I am pretty sure the way you have K1 configured it will be placed across > both DC's as if you had large ring. If you want it only in DC1 you need to > say DC1:1, DC2:0." > Infact i do want K1 to be available across both DCs as if i had a large > ring. I just do not want them to replicate over across DCs. Also i did try > doing it like you said DC1:1, DC2:0 but wont that mean that, all my data > goes into DC1 irrespective of whether the data is getting into the nodes of > DC1 or DC2, thereby creating a "hot DC"? Since the volume of data for this > case is huge, that might create a load imbalance on DC1? (Am i missing > something?) > > > On Thu, Nov 10, 2011 at 1:30 PM, Jeremiah Jordan < > jeremiah.jor...@morningstar.com> wrote: > > > I am pretty sure the way you have K1 configured it will be placed across > > both DC's as if you had large ring. If you want it only in DC1 you need > to > > say DC1:1, DC2:0. > > If you are writing and reading at ONE you are not guaranteed to get the > > data if RF > 1. If RF = 2, and you write with ONE, you data could be > > written to server 1, and then read from server 2 before it gets over > there. > > > > The differing on server times will only really matter for TTL's. Most > > everything else works off comparing user supplied times. > > > > -Jeremiah > > > > > > On 11/10/2011 02:27 PM, Subrahmanya Harve wrote: > > > >> > >> I am facing an issue in 0.8.7 cluster - > >> > >> - I have two clusters in two DCs (rather one cross dc cluster) and two > >> keyspaces. But i have only configured one keyspace to replicate data to > the > >> other DC and the other keyspace to not replicate over to the other DC. > >> Basically this is the way i ran the keyspace creation - > >>create keyspace K1 with placement_strategy='org.** > >> apache.cassandra.locator.**SimpleStrategy' and strategy_options = > >> [{replication_factor:1}]; > >>create keyspace K2 with placement_strategy='org.** > >> apache.cassandra.locator.**NetworkTopologyStrategy' and strategy_options > > >> = [{DC1:2, DC2:2}]; > >> > >> I had to do this because i expect that K1 will get a large volume of > data > >> and i do not want this wired over to the other DC. > >> > >> I am writing the data at CL=ONE and reading the data at CL=ONE. I am > >> seeing an issue where sometimes i get the data and other times i do not > see > >> the data. Does anyone know what could be going on here? > >> > >> A second larger question is - i am migrating from 0.7.4 to 0.8.7 , i > can > >> see that there are large changes in the yaml file, but a specific > question > >> i had was - how do i configure disk_access_mode like it used to be in > 0.7.4? > >> > >> One observation i have made is that some nodes of the cross dc cluster > >> are at different system times. This is something to fix but could this > be > >> why data is sometimes retrieved and other times not? Or is there some > other > >> thing to it? > >> > >> Would appreciate a quick response. > >> > > > >
is that possible to add more data structure(key-list) in cassandra?
I think cassandra is doing great job on key-value data store, it saved me tremendous work on maintain the data consistency and service availability. But I think it would be great if it could support more data structures such as key-list, currently I am using key-value save the list, it seems not very efficiency. Redis has a good point on this but it is not easy to scale. Maybe it is a wrong place and wrong question, only curious if there is already solution about this, thanks a lot!
Efficient map reduce over ranges of Cassandra data
Hey all, I know there are several tickets in the pipe that should make it possible do use secondary indexes to run map reduce jobs that do not have to ingest the entire dataset such as: https://issues.apache.org/jira/browse/CASSANDRA-1600 I had ended up creating a sharded secondary index in user space (I just call it ordered buckets), described here: http://www.slideshare.net/edwardcapriolo/casbase-presentation/27 Looking at the ordered buckets implementation I realized it is a perfect candidate for "efficient map reduce" since it is easy to split. A unit test of that implementation is here: https://github.com/edwardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hadoop/OrderedBucketInputFormatTest.java With this you can current do efficient map reduce on cassandra data, while waiting for other integrated solutions to come along.
range slice with TimeUUID column names
I am using Hector to do a range query for a column family that uses TimeUUIDs as column names. However, I'm not sure how to create the "range". I figured I'd create some UUIDs using the com.eaio.uuid library with timestamps for the range I was interested in. When trying this, I don't get any results back even though I am sure there are UUIDs in the time range. Is this the correct way to do this? Here's how I'm creating my range UUIDs where 'start' and 'finish' are timestamps from java.util.Date: UUID startId = new UUID(UUIDGen.createTime(start), UUIDGen.getClockSeqAndNode()); UUID finishId = new UUID(UUIDGen.createTime(finish), UUIDGen.getClockSeqAndNode());
Re: 1.0.2 Assertion Error
Dne 10.11.2011 22:18, Dan Hendry napsal(a): Is it possible to downgrade back to 0.8? Is there any way to convert 'h' version SSTables to the old 'g' version? Any other data file changes to be aware of? try to add 0.8 node to cluster and decommission 1.0 node. maybe 0.8 will understand streams from 1.0.
Re: is that possible to add more data structure(key-list) in cassandra?
Dne 11.11.2011 5:58, Yan Chunlu napsal(a): I think cassandra is doing great job on key-value data store, it saved me tremendous work on maintain the data consistency and service availability.But I think it would be great if it could support more data structures such as key-list, currently I am using key-value save the list, it seems not very efficiency. Redis has a good point on this but it is not easy to scale. Maybe it is a wrong place and wrong question, only curious if there is already solution about this, thanks a lot! use supercolumns unless your lists are very large.
configurable bloom filters (like hbase)
i have problem with large CF (about 200 billions entries per node). While i can configure index_interval to lower memory requirements, i still have to stick with huge bloom filters. Ideal would be to have bloom filters configurable like in hbase. Cassandra standard is about 1.05% false possitive but in my case i would be fine even with 20% false positive rate. Data are not often read back. Most of them will be never read before they expire via TTL.
Re: Off-heap caching through ByteBuffer.allocateDirect when JNA not available ?
I would like to know it also - actually is should be similar, plus there are no dependencies to sun.misc packages. Regards, Maciej On Thu, Nov 10, 2011 at 1:46 PM, Benoit Perroud wrote: > Thanks for the answer. > I saw the move to sun.misc. > In what sense allocateDirect is broken ? > > Thanks, > > Benoit. > > > 2011/11/9 Jonathan Ellis : > > allocateDirect is broken for this purpose, but we removed the JNA > > dependency using sun.misc.Unsafe instead: > > https://issues.apache.org/jira/browse/CASSANDRA-3271 > > > > On Wed, Nov 9, 2011 at 5:54 AM, Benoit Perroud > wrote: > >> Hi, > >> > >> I wonder if you have already discussed about ByteBuffer.allocateDirect > >> alternative to JNA memory allocation ? > >> > >> If so, do someone mind send me a pointer ? > >> > >> Thanks ! > >> > >> Benoit. > >> > > > > > > > > -- > > Jonathan Ellis > > Project Chair, Apache Cassandra > > co-founder of DataStax, the source for professional Cassandra support > > http://www.datastax.com > > > > > > -- > sent from my Nokia 3210 >
configurable index_interval per keyspace
It would be good to have index_interval configurable per keyspace. Preferably in cassandra.yaml because i use it as tuning on nodes running out of memory without affecting performance noticeably.