Sorting by timestamp
Hi, I have a list of posts I'm trying to insert into Cassandra. Each post has a timestamp already (in the past) that is not necessarily unique. I'm trying to insert these posts into Cassandra so that when I retrieve them, they are ordered chronologically. When I insert them, I can't guarantee that the order of insertion is ordering them based on the actual timestamp of the post. Can someone help me find a solution for this? Thanks. Erik
Re: Datafile Corruption
The dmesg command will usually show information about hardware errors. An example from a spinning disk: sd 0:0:10:0: [sdi] Unhandled sense code sd 0:0:10:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 0:0:10:0: [sdi] Sense Key : Medium Error [current] Info fld=0x6fc72 sd 0:0:10:0: [sdi] Add. Sense: Unrecovered read error sd 0:0:10:0: [sdi] CDB: Read(10): 28 00 00 06 fc 70 00 00 08 00 Also, you can read the file like "cat /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db > /dev/null" If you get an error message, it's probably a hardware issue. - Erik - From: Philip Ó Condúin Sent: Thursday, August 8, 2019 09:58 To: user@cassandra.apache.org Subject: Re: Datafile Corruption Hi Jon, Good question, I'm not sure if we're using NVMe, I don't see /dev/nvme but we could still be using it. We using Cisco UCS C220 M4 SFF so I'm just going to check the spec. Our Kernal is the following, we're using REDHAT so I'm told we can't upgrade the version until the next major release anyway. root@cass 0 17:32:28 ~ # uname -r 3.10.0-957.5.1.el7.x86_64 Cheers, Phil On Thu, 8 Aug 2019 at 17:35, Jon Haddad mailto:j...@jonhaddad.com>> wrote: Any chance you're using NVMe with an older Linux kernel? I've seen a *lot* filesystem errors from using older CentOS versions. You'll want to be using a version > 4.15. On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin mailto:philipocond...@gmail.com>> wrote: @Jeff - If it was hardware that would explain it all, but do you think it's possible to have every server in the cluster with a hardware issue? The data is sensitive and the customer would lose their mind if I sent it off-site which is a pity cause I could really do with the help. The corruption is occurring irregularly on every server and instance and column family in the cluster. Out of 72 instances, we are getting maybe 10 corrupt files per day. We are using vnodes (256) and it is happening in both DC's @Asad - internode compression is set to ALL on every server. I have checked the packets for the private interconnect and I can't see any dropped packets, there are dropped packets for other interfaces, but not for the private ones, I will get the network team to double-check this. The corruption is only on the application schema, we are not getting corruption on any system or cass keyspaces. Corruption is happening in both DC's. We are getting corruption for the 1 application schema we have across all tables in the keyspace, it's not limited to one table. Im not sure why the app team decided to not use default compression, I must ask them. I have been checking the /var/log/messages today going back a few weeks and can see a serious amount of broken pipe errors across all servers and instances. Here is a snippet from one server but most pipe errors are similar: Jul 9 03:00:08 cassandra: INFO 02:00:08 Writing Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072 ops, 0%/0% of on/off-heap limit) Jul 9 03:00:13 kernel: fnic_handle_fip_timer: 8 callbacks suppressed Jul 9 03:00:19 kernel: fnic_handle_fip_timer: 8 callbacks suppressed Jul 9 03:00:22 cassandra: ERROR 02:00:22 Got an IOException during write! Jul 9 03:00:22 cassandra: java.io.IOException: Broken pipe Jul 9 03:00:22 cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_172] Jul 9 03:00:22 cassandra: at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172] Jul 9 03:00:22 cassandra: at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172] Jul 9 03:00:22 cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_172] Jul 9 03:00:22 cassandra: at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_172] Jul 9 03:00:22 cassandra: at org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165) ~[libthrift-0.9.2.jar:0.9.2] Jul 9 03:00:22 cassandra: at com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104) ~[thrift-server-0.3.7.jar:na] Jul 9 03:00:22 cassandra: at com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112) ~[thrift-server-0.3.7.jar:na] Jul 9 03:00:22 cassandra: at com.thinkaurelius.thrift.Message.write(Message.java:222) ~[thrift-server-0.3.7.jar:na] Jul 9 03:00:22 cassandra: at com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598) [thrift-server-0.3.7.jar:na] Jul 9 03:00:22 cassandra: at com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569) [thrift-server-0.3.7.jar:na] Jul 9 03:00:22 cassandra: at com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423) [thrift-server-0.3.7.
Lot's of hints, but only on a few nodes
I have this situation where a few (like, 3-4 out of 84) nodes misbehave. Very long GC pauses, dropping out of cluster etc. This happens while loading data (via CQL), and analyzing metrics it looks like on these few nodes, a lot of hints are being generated close to the time when they start to misbehave. Since this is Cassandra 2.0.13 which have a less than optimal hints implementation, largs numbers of hints is a GC troublemaker. Again looking at metrics, it looks like hints are being generated for a large number of nodes, so it doesn't look like the destination nodes are at fault. So, I'm confused. Any Hints (pun intended) on what could cause a few nodes to generate more hints than the rest of the cluster? Regards, \EF
Extending a partially upgraded cluster - supported
Hi! I have a 2.0.13 cluster which I need to do two things with: * Extend it * Upgrade to 2.1.14 I'm pondering in what order to do things. Is it a supported operation to extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0 where not all sstables have been upgraded? If I do that, will the sstables written on the new nodes be in the 2.1 format when I add them? Or will they be written in the 2.0 format so I'll have to run upgradesstables anyway? The cleanup I do on the existing nodes, will write the new 2.1 format, right? There might be other reasons not to do this, one being that it's seldom wise to do many operations at once. So please enlighten me on how bad an idea this is :-) Thanks, \EF
Re: Extending a partially upgraded cluster - supported
On 2016-05-18 20:19, Jeff Jirsa wrote: You can’t stream between versions, so in order to grow the cluster, you’ll need to be entirely on 2.0 or entirely on 2.1. OK. I was sure you can't stream between a 2.0 node and a 2.1 node, but if I understand you correctly you can't stream between two 2.1 nodes unless the sstables on the source node has been upgraded to "ka", i.e. the 2.1 sstable version? Looks like it's extend first, upgrade later, given that we're a bit close on disk capacity. Thanks, \EF If you go to 2.1 first, be sure you run upgradesstables before you try to extend the cluster. On 5/18/16, 11:17 AM, "Erik Forsberg" wrote: Hi! I have a 2.0.13 cluster which I need to do two things with: * Extend it * Upgrade to 2.1.14 I'm pondering in what order to do things. Is it a supported operation to extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0 where not all sstables have been upgraded? If I do that, will the sstables written on the new nodes be in the 2.1 format when I add them? Or will they be written in the 2.0 format so I'll have to run upgradesstables anyway? The cleanup I do on the existing nodes, will write the new 2.1 format, right? There might be other reasons not to do this, one being that it's seldom wise to do many operations at once. So please enlighten me on how bad an idea this is :-) Thanks, \EF
upgradesstables/cleanup/compaction strategy change
Hi! I have a 2.0.13 cluster which I have just extended, and I'm now looking into upgrading it to 2.1. * The cleanup after the extension is partially done. * I'm also looking into changing a few tables into Leveled Compaction Strategy. In the interest of speeding up things by avoiding unnecessary rewrites of data, I'm pondering if I can: 1. Upgrade to 2.1, then run cleanup instead of upgradesstables getting cleanup + upgrade of sstable format to ka at the same time? 2. Upgrade to 2.1, then change compaction strategy and get LCS + upgrade of sstable format to ka at the same time? Comments on that? Thanks, \EF
RE: Doing rolling upgrades on two DCs
You didn't mention which version you're upgrading to, but generally you shouldn't add or remove nodes while the nodes your cluster is not all on on the same version. So if you're going to add nodes, you should do that before you upgrade the first node, or after you have upgraded all of them. Shifting traffic away from DC1 may not be necessary, but shouldn't hurt. - EF _ From: Timon Wong [timon86.w...@gmail.com] Sent: Saturday, August 6, 2016 02:03 To: user@cassandra.apache.org Subject: Doing rolling upgrades on two DCs Hello, newbie here. Currently we are using Cassandra v2.2.1 (two DCs, naming them DC1, DC2, respectively), and recently we have two issues: 1. Cannot adding new node due to CASSANDRA-10961 2. Cannot repair node, due to CASSANDRA-10501 We learned that we can do rolling upgrade, since we are only using one DC1 for now (DC2 is just for replica), we have following steps in mind: 1. Backup; 2. Doing upgrade on DC2. (Follows this guide: https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandraDetails.html); 3. Do some checks (client compatibility, etc); 4. Adding nodes to DC2 (using new Cassandra); 5. Traffic our loads to DC2; 6. Upgrading DC1. Is that possible, or do we miss something? Thanks in advance! Timon
Re: Doing rolling upgrades on two DCs
The low-risk approach would be to change one thing at a time, but I don't think there's a technical reason you couldn't change the commit log and data directory locations while updating to a new version. - EF On 08/08/2016 07:58 AM, Timon Wong wrote: Thank you! We would do a patch level upgrade, from 2.2.1 to 2.2.7 (the latest stable). Besides, we want to adjust both `commitlog_directory` and `data_file_directories` settings (separate them into two better IOPS volumes), and we assumed they should be done before or after upgrade? On Sun, Aug 7, 2016 at 5:38 AM, Forkalsrud, Erik <mailto:eforkals...@cj.com>> wrote: You didn't mention which version you're upgrading to, but generally you shouldn't add or remove nodes while the nodes your cluster is not all on on the same version. So if you're going to add nodes, you should do that before you upgrade the first node, or after you have upgraded all of them. Shifting traffic away from DC1 may not be necessary, but shouldn't hurt. - EF _ From: Timon Wong [timon86.w...@gmail.com <mailto:timon86.w...@gmail.com>] Sent: Saturday, August 6, 2016 02:03 To: user@cassandra.apache.org <mailto:user@cassandra.apache.org> Subject: Doing rolling upgrades on two DCs Hello, newbie here. Currently we are using Cassandra v2.2.1 (two DCs, naming them DC1, DC2, respectively), and recently we have two issues: 1. Cannot adding new node due to CASSANDRA-10961 2. Cannot repair node, due to CASSANDRA-10501 We learned that we can do rolling upgrade, since we are only using one DC1 for now (DC2 is just for replica), we have following steps in mind: 1. Backup; 2. Doing upgrade on DC2. (Follows this guide: https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandraDetails.html <https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandraDetails.html>); 3. Do some checks (client compatibility, etc); 4. Adding nodes to DC2 (using new Cassandra); 5. Traffic our loads to DC2; 6. Upgrading DC1. Is that possible, or do we miss something? Thanks in advance! Timon
How are writes handled while adding nodes to cluster?
Hi! How are writes handled while I'm adding a node to a cluster, i.e. while the new node is in JOINING state? Are they queued up as hinted handoffs, or are they being written to the joining node? In the former case I guess I have to make sure my max_hint_window_in_ms is long enough for the node to become NORMAL or hints will get dropped and I must do repair. Am I right? Thanks, \EF
A few misbehaving nodes
Hi! I have this problem where 3 of my 84 nodes misbehave with too long GC times, leading to them being marked as DN. This happens when I load data to them using CQL from a hadoop job, so quite a lot of inserts at a time. The CQL loading job is using TokenAwarePolicy with fallback to DCAwareRoundRobinPolicy. Cassandra java driver version 2.1.7.1 is in use. My other observation is that around the time the GC starts to work like crazy, there is a lot of outbound network traffic from the troublesome nodes. If a healthy node has around 25 Mbit/s in, 25 Mbit/s out, an unhealthy sees 25 Mbit/s in, 200 Mbit/s out. So, something is iffy with these 3 nodes, but I have some trouble finding out exactly what makes them differ. This is Cassandra 2.0.13 (yes, old) using vnodes. Keyspace is using NetworkTopologyStrategy with replication 2, in one datacenter. One thing I know I'm doing wrong is that I have slightly differing number of hosts in each of my 6 chassies (One of them have 15 nodes, one of have 13, the remaining have 14). Could what I'm seeing here be the effect of that? Other ideas on what could be wrong? Some kind of vnode imbalance? How can I diagnose that? What metrics should I be looking at? Thanks, \EF
Re: A few misbehaving nodes
On 2016-04-19 15:54, sai krishnam raju potturi wrote: hi; do we see any hung process like Repairs on those 3 nodes? what does "nodetool netstats" show?? No hung process from what I can see. root@cssa02-06:~# nodetool tpstats Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 01530227 0 0 RequestResponseStage 0 0 19230947 0 0 MutationStage 0 0 37059234 0 0 ReadRepairStage 0 0 80178 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 43003 0 0 CacheCleanupExecutor 0 0 0 0 0 MigrationStage0 0 0 0 0 MemoryMeter 0 0267 0 0 FlushWriter 0 0202 0 5 ValidationExecutor0 0212 0 0 InternalResponseStage 0 0 0 0 0 AntiEntropyStage 0 0427 0 0 MemtablePostFlusher 0 0669 0 0 MiscStage 0 0212 0 0 PendingRangeCalculator0 0 70 0 0 CompactionExecutor0 0 1206 0 0 commitlog_archiver0 0 0 0 0 HintedHandoff 0 1113 0 0 Message type Dropped RANGE_SLICE 1 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 219 MUTATION 3 _TRACE 0 REQUEST_RESPONSE 2 COUNTER_MUTATION 0 root@cssa02-06:~# nodetool netstats Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 75317 Mismatch (Blocking): 0 Mismatch (Background): 11 Pool NameActive Pending Completed Commandsn/a 1 19248846 Responses n/a 0 19875699 \EF
Dynamic Column Families in CQLSH v3
Hello All, Attempting to create what the Datastax 1.1 documentation calls a Dynamic Column Family (http://www.datastax.com/docs/1.1/ddl/column_family#dynamic-column-families) via CQLSH. This works in v2 of the shell: "create table data ( key varchar PRIMARY KEY) WITH comparator=LongType;" When defined this way via v2 shell, I can successfully switch to v3 shell and query the CF fine. The same syntax in v3 yields: "Bad Request: comparator is not a valid keyword argument for CREATE TABLE" The 1.1 documentation indicates that comparator is a valid option for at least ALTER TABLE: http://www.datastax.com/docs/1.1/configuration/storage_configuration#comparator This leads me to believe that the correct way to create a dynamic column family is to create a table with no named columns and alter the table later but that also does not work: "create table data (key varchar PRIMARY KEY);" yields: "Bad Request: No definition found that is not part of the PRIMARY KEY" So, my question is, how do I create a Dynamic Column Family via the CQLSH v3? Thanks! -erik
Re: best way to clean up a column family? 60Gig of dangling data
Have you tried to (via jmx) call org.apache.cassandra.db.CompactionManager.forceUserDefinedCompaction() and give it the name of your SSTable file. It's a trick I use to aggressively get rid of expired data, i.e. if I have a column family where all data is written with a TTL of 30 days, any SSTable files with last modified time of more than 30 days ago will have only expired data, so I call the above function to compact those files one by one. In your case it sounds like it's not expired data, but data that belongs on other nodes that you want to get rid of. I'm not sure if compaction will drop data that doesn't fall within the nodes key range, but if it does this method should have the effect you're after. - Erik - On 02/27/2013 08:51 PM, Hiller, Dean wrote: Okay, we had 6 nodes of 130Gig and it was slowly increasing. Through our operations to modify bloomfilter fp chance, we screwed something up as trying to relieve memory pressures was tough. Anyways, somehow, this caused nodes 1, 2, and 3 to jump to around 200Gig and our incoming data stream is completely constant at around 260 points/second. Sooo, we know this dangling data(around 60Gigs) is in one single column family. Node 1, 2, and 3 is for the first token range according to ringdescribe. It is almost like the issue is now replicated to the other two nodes. Is there any way we can go about debugging this and release the 60 gigs of disk space? Also, the upgradesstables when memory is already close to max is not working too well. Can we do this instead(ie. Is it safe?)? 1. Bring down the node 2. Move all the *Index.db files to another directory 3. Start the node and run upgradesstables We know this relieves a ton of memory out of the gate for us. We are trying to get memory back down by a gig, then upgrade to 1.2.2 and switch to leveled compaction as we have ZERO I/o really going on most of the time and really just have this bad bad memory bottleneck(iostat shows nothing typically as we are bottlenecked by memory). Thanks, Dean
Re: Unable to connect to C* nodes except one node (same configuration)
On 07/25/2017 05:13 AM, Junaid Nasir wrote: |listen_address: 10.128.1.1 rpc_address: 10.128.1.1| Are these the values on all three nodes? If so, try with empty values: |listen_address: rpc_address:| or make sure each node has its own IP address configured.
sstableloader and ttls
Hi! If I use sstableloader to load data to a cluster, and the source sstables contain some columns where the TTL has expired, i.e. the sstable has not yet been compacted - will those entries be properly removed on the destination side? Thanks, \EF
Running sstableloader from live Cassandra server
Hi! I'm looking into moving some data from one Cassandra cluster to another, both of them running Cassandra 1.2.13 (or maybe some later 1.2 version if that helps me avoid some fatal bug). Sstableloader will probably be the right thing for me, and given the size of my tables, I will want to run the sstableloader on the source cluster, but at the same time, that source cluster needs to keep running to serve data to clients. If I understand the docs right, this means I will have to: 1. Bring up a new network interface on each of my source nodes. No problem, I have an IPv6 /64 to choose from :-) 2. Put a cassandra.yaml in the classpath of the sstableloader that differs from the one in /etc/cassandra/conf, i.e. the one used by the source cluster's cassandra, with the following: * listen_address set to my new interface. * rpc_address set to my new interface. * rpc_port set as on the destination cluster (i.e. 9160) * cluster_name set as on the destination cluster. * storage_port as on the destination cluster (i.e. 7000) Given the above I should be able to run sstableloader on the nodes of my source cluster, even with source cluster cassandra daemon running. Am I right, or did I miss anything? Thanks, \EF
LeveledCompaction, streaming bulkload, and lot's of small sstables
Hi! I'm bulkloading via streaming from Hadoop to my Cassandra cluster. This results in a rather large set of relatively small (~1MiB) sstables as the number of mappers that generate sstables on the hadoop cluster is high. With SizeTieredCompactionStrategy, the cassandra cluster would quickly compact all these small sstables into decently sized sstables. With LeveledCompactionStrategy however, it takes a much longer time. I have multithreaded_compaction: true, but it is only taking on 32 sstables at a time in one single compaction task, so when it starts with ~1500 sstables, it takes quite some time. I'm not running out of I/O. Is there some configuration knob I can tune to make this happen faster? I'm getting a bit confused by the description for min_sstable_size, bucket_high, bucket_low etc - and I'm not sure if they apply in this case. I'm pondering options for decreasing the number of sstables being streamed from the hadoop side, but if that is possible remains to be seen. Thanks! \EF
Re: LeveledCompaction, streaming bulkload, and lot's of small sstables
On 2014-08-18 19:52, Robert Coli wrote: > On Mon, Aug 18, 2014 at 6:21 AM, Erik Forsberg <mailto:forsb...@opera.com>> wrote: > > Is there some configuration knob I can tune to make this happen faster? > I'm getting a bit confused by the description for min_sstable_size, > bucket_high, bucket_low etc - and I'm not sure if they apply in this > case. > > > You probably don't want to use multi-threaded compaction, it is removed > upstream. > > nodetool setcompactionthroughput 0 > > Assuming you have enough IO headroom etc. OK. I disabled multithreaded and gave it a bit more throughput to play with, but I still don't think that's the full story. What I see is the following case: 1) My hadoop cluster is bulkloading around 1000 sstables to the Cassandra cluster. 2) Cassandra will start compacting. With SizeTiered, I would see multiple ongoing compactions on the CF in question, each taking on 32 sstables and compacting to one, all of them running at the same time. With Leveled, I see only one compaction, taking on 32 sstables compacting to one. When that finished, it will start another one. So it's essentially a serial process, and it takes a much longer time than what it does with SizeTiered. While this compaction is ongoing, read performance is not very good. http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 mentions LCS is parallelized in Cassandra 1.2, but maybe that patch doesn't cover my use case (although I realize that my use case is maybe a bit weird) So my question is if this is something I can tune? I'm running 1.2.18 now, but am strongly considering upgrade to 2.0.X. Regards, \EF
Running out of disk at bootstrap in low-disk situation
Hi! We have unfortunately managed to put ourselves in a situation where we are really close to full disks on our existing 27 nodes. We are now trying to add 15 more nodes, but running into problems with out of disk space on the new nodes while joining. We're using vnodes, on Cassandra 1.2.18 (yes, I know that's old, and I'll upgrade as soon as I'm out of this problematic situation). I've added all the 15 nodes, with some time inbetween - definitely more than the 2-minute rule. But it seems like compaction is not keeping up with the incoming data. Or at least that's my theory. What are the recommended settings to avoid this problem? I have now set compaction threshold to 0 for unlimited compaction bandwidth, hoping that will help (will it?) Will it help to lower the streaming throughput too? I'm unsure about the latter since from observation it seems that compaction will not start until it has finished streaming from a node. With 27 nodes sharing the incoming bandwidth, all of them will take equally long time to finish and then the compaction can occur. I guess I could limit streaming bandwidth on some of the source nodes too. Or am I completely wrong here? Other ideas most welcome. Regards, \EF
Restart joining node
Hi! On the same subject as before - due to full disk during bootstrap, my joining nodes are stuck. What's the correct procedure here, will a plain restart of the node do the right thing, i.e. continue where bootstrap stopped, or is it better to clean the data directories before new start of daemon? Regards, \EF
Working with legacy data via CQL
Hi! I have some data in a table created using thrift. In cassandra-cli, the 'show schema' output for this table is: create column family Users with column_type = 'Standard' and comparator = 'AsciiType' and default_validation_class = 'UTF8Type' and key_validation_class = 'LexicalUUIDType' and column_metadata = [ {column_name : 'date_created', validation_class : LongType}, {column_name : 'active', validation_class : IntegerType, index_name : 'Users_active_idx_1', index_type : 0}, {column_name : 'email', validation_class : UTF8Type, index_name : 'Users_email_idx_1', index_type : 0}, {column_name : 'username', validation_class : UTF8Type, index_name : 'Users_username_idx_1', index_type : 0}, {column_name : 'default_account_id', validation_class : LexicalUUIDType}]; >From cqlsh, it looks like this: [cqlsh 4.1.1 | Cassandra 2.0.11 | CQL spec 3.1.1 | Thrift protocol 19.39.0] Use HELP for help. cqlsh:test> describe table Users; CREATE TABLE "Users" ( key 'org.apache.cassandra.db.marshal.LexicalUUIDType', column1 ascii, active varint, date_created bigint, default_account_id 'org.apache.cassandra.db.marshal.LexicalUUIDType', email text, username text, value text, PRIMARY KEY ((key), column1) ) WITH COMPACT STORAGE; CREATE INDEX Users_active_idx_12 ON "Users" (active); CREATE INDEX Users_email_idx_12 ON "Users" (email); CREATE INDEX Users_username_idx_12 ON "Users" (username); Now, when I try to extract data from this using cqlsh or the python-driver, I have no problems getting data for the columns which are actually UTF8,but for those where column_metadata have been set to something else, there's trouble. Example using the python driver: -- snip -- In [8]: u = uuid.UUID("a6b07340-047c-4d4c-9a02-1b59eabf611c") In [9]: sess.execute('SELECT column1,value from "Users" where key = %s and column1 = %s', [u, 'username']) Out[9]: [Row(column1='username', value=u'uc6vf')] In [10]: sess.execute('SELECT column1,value from "Users" where key = %s and column1 = %s', [u, 'date_created']) --- UnicodeDecodeErrorTraceback (most recent call last) in () > 1 sess.execute('SELECT column1,value from "Users" where key = %s and column1 = %s', [u, 'date_created']) /home/forsberg/dev/virtualenvs/ospapi/local/lib/python2.7/site-packages/cassandra/cluster.pyc in execute(self, query, parameters, timeout, trace) 1279 future = self.execute_async(query, parameters, trace) 1280 try: -> 1281 result = future.result(timeout) 1282 finally: 1283 if trace: /home/forsberg/dev/virtualenvs/ospapi/local/lib/python2.7/site-packages/cassandra/cluster.pyc in result(self, timeout) 2742 return PagedResult(self, self._final_result) 2743 elif self._final_exception: -> 2744 raise self._final_exception 2745 else: 2746 raise OperationTimedOut(errors=self._errors, last_host=self._current_host) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected end of data -- snap -- cqlsh gives me similar errors. Can I tell the python driver to parse some column values as integers, or is this an unsupported case? For sure this is an ugly table, but I have data in it, and I would like to avoid having to rewrite all my tools at once, so if I could support it from CQL that would be great. Regards, \EF
Re: Working with legacy data via CQL
On 2014-11-11 19:40, Alex Popescu wrote: > On Tuesday, November 11, 2014, Erik Forsberg <mailto:forsb...@opera.com>> wrote: > > > You'll have better chances to get an answer about the Python driver on > its own mailing > list > https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user As I said, this also happens when using cqlsh: cqlsh:test> SELECT column1,value from "Users" where key = a6b07340-047c-4d4c-9a02-1b59eabf611c and column1 = 'date_created'; column1 | value --+-- date_created | '\x00\x00\x00\x00Ta\xf3\xe0' (1 rows) Failed to decode value '\x00\x00\x00\x00Ta\xf3\xe0' (for column 'value') as text: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected end of data So let me rephrase: How do I work with data where the table has metadata that makes some columns differ from the main validation class? From cqlsh, or the python driver, or any driver? Thanks, \EF
Re: Working with legacy data via CQL
On 2014-11-15 01:24, Tyler Hobbs wrote: > What version of cassandra did you originally create the column family > in? Have you made any schema changes to it through cql or > cassandra-cli, or has it always been exactly the same? Oh that's a tough question given that the cluster has been around since 2011. So CF was probably created in Cassandra 0.7 or 0.8 via thrift calls from pycassa, and I don't think there has been any schema changes to it since. Thanks, \EF > > On Wed, Nov 12, 2014 at 2:06 AM, Erik Forsberg <mailto:forsb...@opera.com>> wrote: > > On 2014-11-11 19:40, Alex Popescu wrote: > > On Tuesday, November 11, 2014, Erik Forsberg <mailto:forsb...@opera.com> > > <mailto:forsb...@opera.com <mailto:forsb...@opera.com>>> wrote: > > > > > > You'll have better chances to get an answer about the Python driver on > > its own mailing > > list > https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user > > As I said, this also happens when using cqlsh: > > cqlsh:test> SELECT column1,value from "Users" where key = > a6b07340-047c-4d4c-9a02-1b59eabf611c and column1 = 'date_created'; > > column1 | value > --+-- > date_created | '\x00\x00\x00\x00Ta\xf3\xe0' > > (1 rows) > > Failed to decode value '\x00\x00\x00\x00Ta\xf3\xe0' (for column 'value') > as text: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected > end of data > > So let me rephrase: How do I work with data where the table has metadata > that makes some columns differ from the main validation class? From > cqlsh, or the python driver, or any driver? > > Thanks, > \EF > > > > > -- > Tyler Hobbs > DataStax <http://datastax.com/>
Re: Working with legacy data via CQL
On 2014-11-17 09:56, Erik Forsberg wrote: > On 2014-11-15 01:24, Tyler Hobbs wrote: >> What version of cassandra did you originally create the column family >> in? Have you made any schema changes to it through cql or >> cassandra-cli, or has it always been exactly the same? > > Oh that's a tough question given that the cluster has been around since > 2011. So CF was probably created in Cassandra 0.7 or 0.8 via thrift > calls from pycassa, and I don't think there has been any schema changes > to it since. Actually, I don't think it matters. I created a minimal repeatable set of python code (see below). Running that against a 2.0.11 server, creating fresh keyspace and CF, then insert some data with thrift/pycassa, then trying to extract the data that has a different validation class, the python-driver and cqlsh bails out. cqlsh example after running the below script: cqlsh:badcql> select * from "Users" where column1 = 'default_account_id' ALLOW FILTERING; value "\xf9\x8bu}!\xe9C\xbb\xa7=\xd0\x8a\xff';\xe5" (in col 'value') can't be deserialized as text: 'utf8' codec can't decode byte 0xf9 in position 0: invalid start byte cqlsh:badcql> select * from "Users" where column1 = 'date_created' ALLOW FILTERING; value '\x00\x00\x00\x00Ti\xe0\xbe' (in col 'value') can't be deserialized as text: 'utf8' codec can't decode bytes in position 6-7: unexpected end of data So the question remains - how do I work with this data from cqlsh and / or the python driver? Thanks, \EF --repeatable example-- #!/usr/bin/env python # Run this in virtualenv with pycassa and cassandra-driver installed via pip import pycassa import cassandra import calendar import traceback import time from uuid import uuid4 keyspace = "badcql" sysmanager = pycassa.system_manager.SystemManager("localhost") sysmanager.create_keyspace(keyspace, strategy_options={'replication_factor':'1'}) sysmanager.create_column_family(keyspace, "Users", key_validation_class=pycassa.system_manager.LEXICAL_UUID_TYPE, comparator_type=pycassa.system_manager.ASCII_TYPE, default_validation_class=pycassa.system_manager.UTF8_TYPE) sysmanager.create_index(keyspace, "Users", "username", pycassa.system_manager.UTF8_TYPE) sysmanager.create_index(keyspace, "Users", "email", pycassa.system_manager.UTF8_TYPE) sysmanager.alter_column(keyspace, "Users", "default_account_id", pycassa.system_manager.LEXICAL_UUID_TYPE) sysmanager.create_index(keyspace, "Users", "active", pycassa.system_manager.INT_TYPE) sysmanager.alter_column(keyspace, "Users", "date_created", pycassa.system_manager.LONG_TYPE) pool = pycassa.pool.ConnectionPool(keyspace, ['localhost:9160']) cf = pycassa.ColumnFamily(pool, "Users") user_uuid = uuid4() cf.insert(user_uuid, {'username':'test_username', 'auth_method':'ldap', 'email':'t...@example.com', 'active':1, 'date_created':long(calendar.timegm(time.gmtime())), 'default_account_id':uuid4()}) from cassandra.cluster import Cluster cassandra_cluster = Cluster(["localhost"]) cassandra_session = cassandra_cluster.connect(keyspace) print "username", cassandra_session.execute('SELECT value from "Users" where key = %s and column1 = %s', (user_uuid, 'username',)) print "email", cassandra_session.execute('SELECT value from "Users" where key = %s and column1 = %s', (user_uuid, 'email',)) try: print "default_account_id", cassandra_session.execute('SELECT value from "Users" where key = %s and column1 = %s', (user_uuid, 'default_account_id',)) except Exception as e: print "Exception trying to get default_account_id", traceback.format_exc() cassandra_session = cassandra_cluster.connect(keyspace) try: print "active", cassandra_session.execute('SELECT value from "Users" where key = %s and column1 = %s', (user_uuid, 'active',)) except Exception as e: print "Exception trying to get active", traceback.format_exc() cassandra_session = cassandra_cluster.connect(keyspace) try: print "date_created", cassandra_session.execute('SELECT value from "Users" where key = %s and column1 = %s', (user_uuid, 'date_created',)) except Exception as e: print "Exception trying to get date_created", traceback.format_exc() -- end of example --
Re: Working with legacy data via CQL
On 2014-11-19 01:37, Robert Coli wrote: > > Thanks, I can reproduce the issue with that, and I should be able to > look into it tomorrow. FWIW, I believe the issue is server-side, > not in the driver. I may be able to suggest a workaround once I > figure out what's going on. > > > Is there a JIRA tracking this issue? I like being aware of potential > issues with "legacy tables" ... :D I created one, just for you! :-) https://issues.apache.org/jira/browse/CASSANDRA-8339 \EF
Anonymous user in permissions system?
Hi! Is there such a thing as the anonymous/unauthenticated user in the cassandra permissions system? What I would like to do is to grant select, i.e. provide read-only access, to users which have not presented a username and password. Then grant update/insert to other users which have presented a username and (correct) password. Doable? Regards, \EF
changes to metricsReporterConfigFile requires restart of cassandra?
Hi! I was pleased to find out that cassandra 2.0.x has added support for pluggable metrics export, which even includes a graphite metrics sender. Question: Will changes to the metricsReporterConfigFile require a restart of cassandra to take effect? I.e, if I want to add a new exported metric to that file, will I have to restart my cluster? Thanks, \EF
Re: Anonymous user in permissions system?
On 2015-02-05 12:39, Carlos Rolo wrote: > Hello Erik, > > It seems possible, refer to the following documentation to see if it > fits your needs: > http://www.datastax.com/documentation/cassandra/2.0/cassandra/security/secureInternalAuthenticationTOC.html > http://www.datastax.com/documentation/cassandra/2.0/cassandra/security/secureInternalAuthorizationTOC.html I think you misunderstand what I want. I want to distinguish between authenticated users, i.e. users which have provided a username/password, and anonymous users, i.e. users which havent. Reading the source code, the standard PasswordAuthenticated does not allow this, but by subclassing it I can come up with something that does what I want, since the AuthenticatedUser have some partial support for the "anonymous" user. Regards, \EF
Re: Cluster status instability
To elaborate a bit on what Marcin said: * Once a node starts to believe that a few other nodes are down, it seems to stay that way for a very long time (hours). I'm not even sure it will recover without a restart. * I've tried to stop then start gossip with nodetool on the node that thinks several other nodes is down. Did not help. * nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for all nodes (including the ones marked as down in status output) * It is quite possible that the problem starts at the time of day when we have a lot of bulkloading going on. But why does it then stay for several hours after the load goes down? * I have the feeling this started with our upgrade from 1.2.18 to 2.0.12 about a month ago, but I have no hard data to back that up. Regarding region/snitch - this is not an AWS deployment, we run on our own datacenter with GossipingPropertyFileSnitch. Right now I have this situation with one node (04-05) thinking that there are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all nodes are up. Load on cluster right now is minimal, there's no GC going on. Heap usage is approximately 3.5/6Gb. root@cssa04-05:~# nodetool status|grep DN DN 2001:4c28:1:413:0:1:2:5 1.07 TB256 1.8% 114ff46e-57d0-40dd-87fb-3e4259e96c16 rack2 DN 2001:4c28:1:413:0:1:2:6 1.06 TB256 1.8% b161a6f3-b940-4bba-9aa3-cfb0fc1fe759 rack2 DN 2001:4c28:1:413:0:1:2:13 896.82 GB 256 1.6% 4a488366-0db9-4887-b538-4c5048a6d756 rack2 DN 2001:4c28:1:413:0:1:3:7 1.04 TB256 1.8% 95cf2cdb-d364-4b30-9b91-df4c37f3d670 rack3 Excerpt from nodetool gossipinfo showing one node that status thinks is down (2:5) and one that status thinks is up (3:12): /2001:4c28:1:413:0:1:2:5 generation:1427712750 heartbeat:2310212 NET_VERSION:7 RPC_ADDRESS:0.0.0.0 RELEASE_VERSION:2.0.13 RACK:rack2 LOAD:1.172524771195E12 INTERNAL_IP:2001:4c28:1:413:0:1:2:5 HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16 DC:iceland SEVERITY:0.0 STATUS:NORMAL,100493381707736523347375230104768602825 SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda /2001:4c28:1:413:0:1:3:12 generation:1427714889 heartbeat:2305710 NET_VERSION:7 RPC_ADDRESS:0.0.0.0 RELEASE_VERSION:2.0.13 RACK:rack3 LOAD:1.047542503234E12 INTERNAL_IP:2001:4c28:1:413:0:1:3:12 HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00 DC:iceland SEVERITY:0.0 STATUS:NORMAL,100163259989151698942931348962560111256 SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda I also tried disablegossip + enablegossip on 02-05 to see if that made 04-05 mark it as up, with no success. Please let me know what other debug information I can provide. Regards, \EF On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle wrote: > Do you happen to be using a tool like Nagios or Ganglia that are able to > report utilization (CPU, Load, disk io, network)? There are plugins for > both that will also notify you of (depending on whether you enabled the > intermediate GC logging) about what is happening. > > > > On Thu, Apr 2, 2015 at 8:35 AM, Jan wrote: > >> Marcin ; >> >> are all your nodes within the same Region ? >> If not in the same region, what is the Snitch type that you are using >> ? >> >> Jan/ >> >> >> >> On Thursday, April 2, 2015 3:28 AM, Michal Michalski < >> michal.michal...@boxever.com> wrote: >> >> >> Hey Marcin, >> >> Are they actually going up and down repeatedly (flapping) or just down >> and they never come back? >> There might be different reasons for flapping nodes, but to list what I >> have at the top of my head right now: >> >> 1. Network issues. I don't think it's your case, but you can read about >> the issues some people are having when deploying C* on AWS EC2 (keyword to >> look for: phi_convict_threshold) >> >> 2. Heavy load. Node is under heavy load because of massive number of >> reads / writes / bulkloads or e.g. unthrottled compaction etc., which may >> result in extensive GC. >> >> Could any of these be a problem in your case? I'd start from >> investigating GC logs e.g. to see how long does the "stop the world" full >> GC take (GC logs should be on by default from what I can see [1]) >> >> [1] https://issues.apache.org/jira/browse/CASSANDRA-5319 >> >> Michał >> >> >> Kind regards, >> Michał Michalski, >> michal.michal...@boxever.com >> >> On 2 April 2015 at 11:05, Marcin Pietraszek >> wrote: >> >> Hi! >> >> We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch >> installed. Assume we have nodes A, B, C, D, E. On some irregular basis >> one of those nodes starts to report that subset of other nodes is in >> DN state although C* deamon on all nodes is running: >> >> A$ nodetool status >> UN B >> DN C >> DN D >> UN E >> >> B$ nodetool status >> UN A >> UN C >> UN D >> UN E >> >> C$ nodetool status >> DN A >> UN B >> UN D >> UN E >> >> After restart of A node, C and D report that A it's in UN and also A >> claims that whole cluster is in UN state. Right now I don't have any >> clear steps to
One node misbehaving (lot's of GC), ideas?
Hi! We having problems with one node (out of 56 in total) misbehaving. Symptoms are: * High number of full CMS old space collections during early morning when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few thrift insertions. * Really long stop-the-world GC events (I've seen up to 50 seconds) for both CMS and ParNew. * CPU usage higher during early morning hours compared to other nodes. * The large number of Garbage Collections *seems* to correspond to doing a lot of compactions (SizeTiered for most of our CFs, Leveled for a few small ones) * Node loosing track of what other nodes are up and keeping that state until restart (this I think is a bug caused by the GC behaviour, with the stop-the-world making the node not accepting gossip connections from other nodes) This is on 2.0.13 with vnodes (256 per node). All other nodes have normal behaviour, with a few (2-3) full CMS old space in the same 3h period that the trouble node is making some 30 ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the problem was even worse (it seems, this is a bit hard to debug as it happens *almost* every night). nodetool status shows that although we have a certain unbalance in the cluster, this node is neither the most nor the least loaded. I.e. we have between 1.6% and 2.1% in the "Owns" column, and the troublesome node reports 1.7%. All nodes are under puppet control, so configuration is the same everywhere. We're running NetworkTopolyStrategy with rack awareness, and here's a deviation from recommended settings - we have slightly varying number of nodes in the racks: 15 cssa01 15 cssa02 13 cssa03 13 cssa04 The affected node is in the cssa04 rack. Could this mean I have some kind of hotspot situation? Why would that show up as more GC work? I'm quite puzzled here, so I'm looking for hints on how to identify what is causing this. Regards, \EF
Re: CentOS - Could not setup cluster(snappy error)
You can add something like this to cassandra-env.sh : JVM_OPTS="$JVM_OPTS -Dorg.xerial.snappy.tempdir=/path/that/allows/executables" - Erik - On 12/28/2013 08:36 AM, Edward Capriolo wrote: Check your fstabs settings. On some systems /tmp has noexec set and unpacking a library into temp and trying to run it does not work. On Fri, Dec 27, 2013 at 5:33 PM, Víctor Hugo Oliveira Molinar mailto:vhmoli...@gmail.com>> wrote: Hi, I'm not being able to start a multiple node cluster in a CentOs environment due to snappy loading error. Here is my current setup for both machines(Node 1 and 2), CentOs: CentOS release 6.5 (Final) Java java version "1.7.0_25" Java(TM) SE Runtime Environment (build 1.7.0_25-b15) Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode) Cassandra Version: 2.0.3 Also, I've already replaced the current snappy jar(snappy-java-1.0.5.jar) by the older(snappy-java-1.0.4.1.jar). Although the following error is still happening when I try to start the second node: INFO 20:25:51,879 Handshaking version with /200.219.219.51 <http://200.219.219.51> java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:312) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219) at org.xerial.snappy.Snappy.(Snappy.java:44) at org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79) at org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:66) at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:359) at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:150) Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.4.1-libsnappyjava.so <http://snappy-1.0.4.1-libsnappyjava.so>: /tmp/snappy-1.0.4.1-libsnappyjava.so <http://snappy-1.0.4.1-libsnappyjava.so>: failed to map segment from shared object: Operation not permitted at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1957) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1882) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1843) at java.lang.Runtime.load0(Runtime.java:795) at java.lang.System.load(System.java:1061) at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39) ... 11 more ERROR 20:25:52,201 Exception in thread Thread[WRITE-/200.219.219.51 <http://200.219.219.51>,5,main] org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229) at org.xerial.snappy.Snappy.(Snappy.java:44) at org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79) at org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:66) at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:359) at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:150) ERROR 20:26:22,924 Exception encountered during startup java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1160) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:416) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:608) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:576) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:475) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:346) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504) What else can I do to fix it? Att, /Víctor Hugo Molinar/
Re: Cassandra consuming too much memory in ubuntu as compared to within windows, same machine.
On 01/04/2014 08:04 AM, Ertio Lew wrote: ... my dual boot 4GB(RAM) machine. ... -Xms4G -Xmx4G - You are allocating all your ram to the java heap. Are you using the same JVM parameters on the windows side? You can try to lower the heap size or add ram to your machine. - Erik -
EOFException in bulkloader, then IllegalStateException
Hi! I'm bulkloading from Hadoop to Cassandra. Currently in the process of moving to new hardware for both Hadoop and Cassandra, and while testrunning bulkload, I see the following error: Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1" java.lang.RuntimeException: java.io.EOFException at com.google.common.base.Throwables.propagate(Throwables.java:155) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:193) at org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ... 3 more I see no exceptions related to this on the destination node (2001:4c28:1:413:0:1:1:12:1). This makes the whole map task fail with: 2014-01-27 10:46:50,878 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:forsberg (auth:SIMPLE) cause:java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12] 2014-01-27 10:46:50,878 WARN org.apache.hadoop.mapred.Child: Error running child java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12] at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:244) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:209) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:540) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:650) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) 2014-01-27 10:46:50,880 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task The failed task was on hadoop worker node hdp01-12-4. However, hadoop later retries this map task on a different hadoop worker node (hdp01-10-2), and that retry succeeds. So that's weird, but I could live with it. Now, however, comes the real trouble - the hadoop job does not finish due to one task running on hdp01-12-4 being stuck with this: Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1" java.lang.IllegalStateException: target reports current file is /opera/log2/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000473_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db but is /opera/log6/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_00_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db at org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:154) at org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:45) at org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:199) at org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) This just sits there forever, or at least until the hadoop task timeout kicks in. So two questions here: 1) Any clues on what might cause the first EOFException? It seems to appear for *some* of my bulkloads. Not all, but frequent enough to be a problem. Like, every 10:th bulkload I do seems to have the problem. 2) The second problem I have a feeling could be related to https://issues.apache.org/jira/browse/CASSANDRA-4223, but with the extra quirk that with the bulkload case, we have *multiple java processes* creating streaming sessions on the same host, so streaming session IDs are not unique. I'm thinking 2) happens because the EOFException made the streaming session in 1) sit around on the target node without being closed. This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid upgradin
Re: EOFException in bulkloader, then IllegalStateException
On 2014-01-27 12:56, Erik Forsberg wrote: This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid upgrading until I have made this migration from old to new hardware. Upgrading to 1.2.13 might be an option. Update: Exactly the same behaviour on Cassandra 1.2.13. Thanks, \EF
Multiple large disks in server - setup considerations
Hi! I'm considering setting up a small (4-6 nodes) Cassandra cluster on machines that each have 3x2TB disks. There's no hardware RAID in the machine, and if there were, it could only stripe single disks together, not parts of disks. I'm planning RF=2 (or higher). I'm pondering what the best disk configuration is. Two alternatives: 1) Make small partition on first disk for Linux installation and commit log. Use Linux' software RAID0 to stripe the remaining space on disk1 + the two remaining disks into one large XFS partition. 2) Make small partition on first disk for Linux installation and commit log. Mount rest of disk 1 as /var/cassandra1, then disk2 as /var/cassandra2 and disk3 as /var/cassandra3. Is it unwise to put the commit log on the same physical disk as some of the data? I guess it could impact write performance, but maybe it's bad from a data consistency point of view? How does Cassandra handle replacement of a bad disk in the two alternatives? With option 1) I guess there's risk of files being corrupt. With option 2) they will simply be missing after replacing the disk with a new one. With option 2) I guess I'm limiting the size of the total amount of data in the largest CF at compaction to, hmm.. the free space on the disk with most free space, correct? Comments welcome! Thanks, \EF -- Erik Forsberg Developer, Opera Software - http://www.opera.com/
Re: Multiple large disks in server - setup considerations
On Tue, 31 May 2011 13:23:36 -0500 Jonathan Ellis wrote: > Have you read http://wiki.apache.org/cassandra/CassandraHardware ? I had, but it was a while ago so I guess I kind of deserved an RTFM! :-) After re-reading it, I still want to know: * If we disregard the performance hit caused by having the commitlog on the same physical device as parts of the data, are there any other grave effects on Cassandra's functionality with a setup like that? * How does Cassandra handle a case where one of the disks in a striped RAID0 partition goes bad and is replaced? Is the only option to wipe everything from that node and reinit the node, or will it handle corrupt files? I.e, what's the recommended thing to do from an operations point of view when a disk dies on one of the nodes in a RAID0 Cassandra setup? What will cause the least risk for data loss? What will be the fastest way to get the node up to speed with the rest of the cluster? Thanks, \EF > > On Tue, May 31, 2011 at 7:47 AM, Erik Forsberg > wrote: > > Hi! > > > > I'm considering setting up a small (4-6 nodes) Cassandra cluster on > > machines that each have 3x2TB disks. There's no hardware RAID in the > > machine, and if there were, it could only stripe single disks > > together, not parts of disks. > > > > I'm planning RF=2 (or higher). > > > > I'm pondering what the best disk configuration is. Two alternatives: > > > > 1) Make small partition on first disk for Linux installation and > > commit log. Use Linux' software RAID0 to stripe the remaining space > > on disk1 > > + the two remaining disks into one large XFS partition. > > > > 2) Make small partition on first disk for Linux installation and > > commit log. Mount rest of disk 1 as /var/cassandra1, then disk2 > > as /var/cassandra2 and disk3 as /var/cassandra3. > > > > Is it unwise to put the commit log on the same physical disk as > > some of the data? I guess it could impact write performance, but > > maybe it's bad from a data consistency point of view? > > > > How does Cassandra handle replacement of a bad disk in the two > > alternatives? With option 1) I guess there's risk of files being > > corrupt. With option 2) they will simply be missing after replacing > > the disk with a new one. > > > > With option 2) I guess I'm limiting the size of the total amount of > > data in the largest CF at compaction to, hmm.. the free space on the > > disk with most free space, correct? > > > > Comments welcome! > > > > Thanks, > > \EF > > -- > > Erik Forsberg > > Developer, Opera Software - http://www.opera.com/ > > > > > -- Erik Forsberg Developer, Opera Software - http://www.opera.com/
Re: 0.7.9 RejectedExecutionException
On 10/12/2011 11:33 AM, Ashley Martens wrote: java version "1.6.0_20" OpenJDK Runtime Environment (IcedTea6 1.9.9) (6b20-1.9.9-0ubuntu1~10.10.2) OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode) This may have been mentioned before, but is it an option to use the Sun/Oracle JDK? - Erik -
Re: 0.7.9 RejectedExecutionException
My suggestion would be to put a recent Sun JVM on the problematic node and see if that eliminates the crashes. The Sun JVM appears the be the mainstream choice when running Cassandra, so that's a more well tested configuration. You can search the list archives for OpenJDK related bugs to see that they do exist. For example: http://mail-archives.apache.org/mod_mbox/cassandra-user/201107.mbox/%3CCAB-=z42ihihr8svhdrbvpfyjjysspckes1zoxkvcje9axkd...@mail.gmail.com%3E - Erik - On 10/12/2011 12:41 PM, Ashley Martens wrote: I guess it could be an option but I can't puppet the Oracle JDK install so I would rather not. On Wed, Oct 12, 2011 at 12:35 PM, Erik Forkalsrud <mailto:eforkals...@cj.com>> wrote: On 10/12/2011 11:33 AM, Ashley Martens wrote: java version "1.6.0_20" OpenJDK Runtime Environment (IcedTea6 1.9.9) (6b20-1.9.9-0ubuntu1~10.10.2) OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode) This may have been mentioned before, but is it an option to use the Sun/Oracle JDK? - Erik - -- gpg --keyserver pgpkeys.mit.edu <http://pgpkeys.mit.edu/> --recv-keys 0x23e861255b0d6abb Key fingerprint = 0E9E 0E22 3957 BB04 DD72 B093 23E8 6125 5B0D 6ABB
Re: Updates Question
On 10/24/2011 05:23 AM, Sam Hodgson wrote: I can write new columns/rows etc no problem however when updating I get no errors back and the update does not occur. I've also tried updating manually using the Cassandra cluster admin gui and the same thing happens. I can write but updates do not work, the gui even reports a success but the data remains unchanged. A basic thing to check is that the updates have a higher timestamp than the initial write. I was scratching my head with a similar problem a while back, and it turned out that cassandra-cli used java's System.currentTimeMillis() * 1000 for timestamps while the application code used just System.currentTimeMillis() and those values would always be lower. - Erik -
Re: Internal error processing get: Null pointer exception
On 11/01/2011 09:02 PM, Jonathan Ellis wrote: That doesn't make sense to me. CS:147 is columnFamilyKeyMap.put(row.key, row.cf); where cFKM is Map columnFamilyKeyMap = new HashMap(); So cFKM can't be null, and HashMap accomodates both null key and null value, so I'm not sure what there is to thorw NPE. it must be row == null
Can I use BulkOutputFormat from 1.1 to load data to older Cassandra versions?
Hi! Can the new BulkOutputFormat (https://issues.apache.org/jira/browse/CASSANDRA-3045) be used to load data to servers running cassandra 0.8.7 and/or Cassandra 1.0.6? I'm thinking of using jar files from the development version to load data onto a production cluster which I want to keep on a production version of Cassandra. Can I do that, or does BulkOutputFormat require an API level that is only in the development version of Cassandra? Thanks, \EF
Recommended configuration for good streaming performance?
Hi! We're experimenting with streaming from Hadoop to Cassandra using BulkoutputFormat, on cassandra-1.1 branch. Are there any specific settings we should tune on the Cassandra servers in order to get the best streaming performance? Our Cassandra hardware are 16 core (including HT cores) with 24GiB of RAM. They have two disks each. So far we've configured them with commitlog on one disk and sstables on the other, but with streaming not using commitlog (correct?) maybe it makes sense to have sstables on both disks, doubling available I/O? Thoughts on number of parallel streaming clients? Thanks, \EF
Streaming sessions from BulkOutputFormat job being listed long after they were killed
Hi! If I run a hadoop job that uses BulkOutputFormat to write data to Cassandra, and that hadoop job is aborted, i.e. streaming sessions are not completed, it seems like the streaming sessions hang around for a very long time, I've observed at least 12-15h, in output from 'nodetool netstats'. To me it seems like they go away only after a restart of Cassandra. Is this a known behaviour? Does it cause any problems, f. ex. consuming memory, or should I just ignore it? Regards, \EF
Max TTL?
Hi! When setting ttl on columns, is there a maximum value (other than MAXINT, 2**31-1) that can be used? I have a very odd behaviour here, where I try to set ttl to 9 622 973 (~111 days) which works, but setting it to 11 824 305 (~137 days) does not - it seems columns are deleted instantly at insertion. This is using the BulkOutputFormat. And it could be a problem with our code, i.e. the code using BulkOutputFormat. So, uhm, just asking to see if we're hitting something obvious. Regards, \EF
Re: Max TTL?
On 2012-02-20 21:20, aaron morton wrote: Nothing obvious. Samarth (working on same project) found that his patch to CASSANDRA-3754 was cleaned up a bit too much, which caused a negative ttl. https://issues.apache.org/jira/browse/CASSANDRA-3754?focusedCommentId=13212395&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13212395 So problem found. Regards, \EF
On Bloom filters and Key Cache
Hi! We're currently testing Cassandra with a large number of row keys per node - nodetool cfstats approximated number of keys to something like 700M per node. This seems to have caused a very large heap consumption. After reading http://wiki.apache.org/cassandra/LargeDataSetConsiderations I think I've tracked this down to the bloom filter, and the sampled index entries. Regarding bloom filters, have I understood correctly that they are stored on Heap, and that the "Bloom Filter Space Used" reported by 'nodetool cfstats' is an approximation of the heap space used by bloom filters? It reports the on-disk size, but if I understand CASSANDRA-3497, the on-disk size is smaller than the on-Heap size? I understand that increasing bloom_filter_fp_chance will decrease the bloom filter size, but at the cost of worse performance when asking for keys that don't exist. I do have a fair amount of queries for keys that don't exist. How much will increasing the key cache help, i.e. decrease bloom filter size but increase key cache size? Will the key cache cache negative results, i.e. the fact that a key didn't exist? Regards, \EF
sstable size increase at compaction
Hi! We're using the bulkloader to load data to Cassandra. During and after bulkloading, the minor compaction process seems to result in larger sstables being created. An example: INFO [CompactionExecutor:105] 2012-03-21 15:18:46,608 CompactionTask.java (line 115) Compacting [SSTableReader(pat h='/cassandra/OSP5/Data/OSP5-Data-hc-1755-Data.db'), (REMOVED A BUNCH OF OTHER SSTABLE PATHS), SSTableReader(path='/cassandra/OSP5/Data/OSP5-Data-hc-1749-Data.db'), SSTableReader(path='/cassandra/O SP5/Data/OSP5-Data-hc-1753-Data.db')] INFO [CompactionExecutor:105] 2012-03-21 15:30:04,188 CompactionTask.java (line 226) Compacted to [/cassandra/OSP5/Data/OSP5-Data-hc-3270-Data.db,]. 84,214,484 to 105,498,673 (~125% of original) bytes for 2,132,056 keys at 0.148486MB/s. Time: 677,580ms. The sstables are compressed (DeflateCompressor with chunk size 128) on the Hadoop cluster before being transferred to Cassandra, and the CF has the same compression settings: [default@Keyspace1] describe Data; ColumnFamily: Data (Super) Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.LongType Columns sorted by: org.apache.cassandra.db.marshal.LongType/org.apache.cassandra.db.marshal.UTF8Type GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 DC Local Read repair chance: 0.0 Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: 0.01 Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy Compression Options: chunk_length_kb: 128 sstable_compression: org.apache.cassandra.io.compress.DeflateCompressor Any clues on this? Regards, \EF
Re: sstable size increase at compaction
On 2012-03-21 16:36, Erik Forsberg wrote: Hi! We're using the bulkloader to load data to Cassandra. During and after bulkloading, the minor compaction process seems to result in larger sstables being created. An example: This is on Cassandra 1.1, btw. \EF
Graveyard compactions, when do they occur?
Hi! I was trying out the "truncate" command in cassandra-cli. http://wiki.apache.org/cassandra/CassandraCli08 says "A snapshot of the data is created, which is deleted asyncronously during a 'graveyard' compaction." When do "graveyard" compactions happen? Do I have to trigger them somehow? Thanks, \EF
Re: blob fields, bynary or hexa?
On 04/18/2012 03:02 AM, mdione@orange.com wrote: We're building a database to stock the avatars for our users in three sizes. Thing is, We planned to use the blob field with a ByteType validator, but if we try to inject the binary data as read from the image file, we get a error. Which client are you using? With Hector or straight thrift, your should be able to store byte[] directly. - Erik -
Re: How often to run `nodetool repair`
On 08/01/2013 01:16 PM, Andrey Ilinykh wrote: TTL is effectively DELETE; you need to run a repair once every gc_grace_seconds. If you don't, data might un-delete itself. How is it possible? Every replica has TTL, so it when it expires every replica has tombstone. I don't see how you can get data with no tombstone. What do I miss? The only way I can think of is this scenario: - value "A" for some key is written with ttl=30days, to all replicas (i.e a long ttl or no ttl at all) - value "B" for the same key is written with ttl=1day, but doesn't reach all replicas - one day passes and the ttl=1day values turn into deletes - gc_grace passes and the tombstones are purged at this point, the replica that didn't get the ttl=1day value will think the older value "A" is live. I'm no expert on this so I may be mistaken, but in any case it's a corner case as overwriting columns with shorter ttls would be unusual. - Erik -
Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5
but that doesn't really explain a two minute pause during garbage collection. My guess then, is that one or more of the following are at play in this scenario: 1) a core nehalem bug - the nehalem architecture made a lot of changes to the way it manges TLBs for memory, largely as a virtualization optimization. I doubt this is the case but assuming the guest isn't seeing a different architecture, we did see this issue only on E5507 processors. 2) the physical servers in the new AZ are drastically overcommitted - maybe AMZN bought into the notion that Nehalems are better at virtualization and is allowing more guests to run on physical servers running Nehalems. I've no idea, just a hypothesis. 3) a hypervisor bug - I've deployed large JVMs to big physical Nehalem boxen running Cassandra clusters under high load and never seen behavior like the above. If I could see more of what the hypervisor was doing I'd have a pretty good idea here, but such is life in the cloud. I also should say that I don't think any issues we had were at all related specifically to Cassandra. We were running fine in the first AZ, no problems other than needing to grow capacity. Only when we saw the different architecture in the new EC2 AZ did we experience problems and when we shackled the new generation collector, the bad problems went away. Sorry for the long tirade. This was originally going to be a blog post but I though it would have more value in context here. I hope ultimately it helps someone else. -erik On Thu, Jan 13, 2011 at 5:26 PM, Mike Malone wrote: > Hey folks, > > We've discovered an issue on Ubuntu/Lenny with libc6 2.11.1-0ubuntu7.5 (it > may also affect versions between 2.11.1-0ubuntu7.1 and 2.11.1-0ubuntu7.4). > The bug affects systems when a large number of threads (or processes) are > created rapidly. Once triggered, the system will become completely > unresponsive for ten to fifteen minutes. We've seen this issue on our > production Cassandra clusters under high load. Cassandra seems particularly > susceptible to this issue because of the large thread pools that it creates. > In particular, we suspect the unbounded thread pool for connection > management may be pushing some systems over the edge. > > We're still trying to narrow down what changed in libc that is causing this > issue. We also haven't tested things outside of xen, or on non-x86 > architectures. But if you're seeing these symptoms, you may want to try > upgrading libc6. > > I'll send out an update if we find anything else interesting. If anyone has > any thoughts as to what the cause is, we're all ears! > > Hope this saves someone some heart-ache, > > Mike >
Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5
Forgot one critical point, we use zero swap on any of these hosts.
Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5
Too similar to be a coincidence I'd say: Good node (old AZ): 2.11.1-0ubuntu7.5 Bad node (new AZ): 2.11.1-0ubuntu7.6 You beat me to the punch with the test program. I was working on something similar to test it out and got side tracked. I'll try the test app tomorrow and verify the versions of the AMIs used for provisioning. On Thu, Jan 13, 2011 at 11:31 PM, Mike Malone wrote: > Erik, the scenario you're describing is almost identical to what we've been > experiencing. Sounds like you've been pulling your hair out too! You're also > running the same distro and kernel as us. And we also run without swap. > Which begs the question... what version of libc6 are you running!? Here's > the output from one of our upgraded boxes: > > $ dpkg --list | grep libc6 > ii libc62.11.1-0ubuntu7.7 > Embedded GNU C Library: Shared libraries > ii libc6-dev2.11.1-0ubuntu7.7 > Embedded GNU C Library: Development Librarie > > Before upgrading the version field showed 2.11.1-0ubuntu7.5. Wondering what > yours is. > > We also found ourselves in a similar situation with different regions. > We're using the canonical ubuntu ami as the base for our systems. But there > appear to be small differences between the packages included in the amis > from different regions. Seems libc6 is one of the things that changed. I > discovered by diff'ing `dpkg --list` on a node that was good, and one that > was bad. > > The architecture hypothesis is also very interesting. If we could reproduce > the bug with the latest libc6 build I'd escalate it back up to Amazon. But I > can't repro it, so nothing to escalate. > > For what it's worth, we were able to reproduce the lockup behavior that > you're describing by running a tight loop that spawns threads. Here's a gist > of the app I used: https://gist.github.com/a4123705e67e9446f1cc -- I'd be > interested to know whether that locks things up on your system with a new > libc6. > > Mike > > On Thu, Jan 13, 2011 at 10:39 PM, Erik Onnen wrote: > >> May or may not be related but I thought I'd recount a similar experience >> we had in EC2 in hopes it helps someone else. >> >> As background, we had been running several servers in a 0.6.8 ring with no >> Cassandra issues (some EC2 issues, but none related to Cassandra) on >> multiple EC2 XL instances in a single availability zone. We decided to add >> several other nodes to a second AZ for reasons beyond the scope of this >> email. As we reached steady operational state in the new AZ, we noticed that >> the new nodes in the new AZ were repeatedly getting dropped from the ring. >> At first we attributed the drops to phi and expected cross-AZ latency. As we >> tried to pinpoint the issue, we found something very similar to what you >> describe - the EC2 VMs in the new AZ would become completely unresponsive. >> Not just the Java process hosting Cassandra, but the entire host. Shell >> commands would not execute for existing sessions, we could not establish new >> SSH sessions and tails we had on active files wouldn't show any progress. It >> appeared as if the machines in the new AZ would seize for several minutes, >> then come back to life with little rhyme or reason as to why. Tickets opened >> with AMZN resulted in responses of "the physical server looks normal". >> >> After digging deeper, here's what we found. To confirm all nodes in both >> AZs were identical at the following levels: >> * Kernel (2.6.32-305-ec2 #9-Ubuntu SMP) distro (Ubuntu 10.04.1 LTS) and >> glibc on x86_64 >> * All nodes were running identical Java distributions that we deployed >> ourselves, sun 1.6.0_22-b04 >> * Same amount of virtualized RAM visible to the guest, same RAID stripe >> configuration across the same size/number of ephemeral drives >> >> We noticed two things that were different across the VMs in the two AZs: >> * The class of CPU exposed to the guest OSes across the two AZs (and >> presumably the same physical server above that guest). >> ** On hosts in the AZ not having issues, we see from the guest older >> Harpertown class Intel CPUs: "model name : Intel(R) Xeon(R) CPU >> E5430 @ 2.66GHz" >> ** On hosts in the AZ having issues, we see from the guest newer Nehalem >> class Intel CPUs: "model name : Intel(R) Xeon(R) CPU E5507 >> @ 2.27GHz" >> * Percent steal was consistently higher on the new nodes, on average 25% >> where as the older (stable) VMs were around 9% at peak load >> >> Consist
Re: Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5
Unfortunately, the previous AMI we used to provision the 7.5 version is no longer available. More unfortunately, the two test nodes we spun up in each AZ did not get Nehalem architectures so the only things I can say for certain after running Mike's test 10x on each test node are: 1) I could not repro using Mike's test in either AZ. We had previously only seen failures in one AZ. 2) I could not repro using Mike's test against libc7.7, the only version of libc we could provision with a nominal amount of effort. 3) 1&2 above were only tested on older Harpertown cores, an architecture where we've never seen the issue. I think the only way I'll be able to further test is to wait until we shuffle around our production configuration and I can abuse the existing guests without taking a node out of my ring. The only silver lining so far is that we've realized we need to better lock down how we source AMIs for our EC2 nodes. That will give us a more reliable system, but apparently we have no control over the actual architecture that AMZN lets us have. -erik On Fri, Jan 14, 2011 at 10:09 AM, Mike Malone wrote: > That's interesting. For us, the 7.5 version of libc was causing problems. > Either way, I'm looking forward to hearing about anything you find. > > Mike > > > On Thu, Jan 13, 2011 at 11:47 PM, Erik Onnen wrote: > >> Too similar to be a coincidence I'd say: >> >> Good node (old AZ): 2.11.1-0ubuntu7.5 >> Bad node (new AZ): 2.11.1-0ubuntu7.6 >> >> You beat me to the punch with the test program. I was working on something >> similar to test it out and got side tracked. >> >> I'll try the test app tomorrow and verify the versions of the AMIs used >> for provisioning. >> >> On Thu, Jan 13, 2011 at 11:31 PM, Mike Malone wrote: >> >>> Erik, the scenario you're describing is almost identical to what we've >>> been experiencing. Sounds like you've been pulling your hair out too! You're >>> also running the same distro and kernel as us. And we also run without swap. >>> Which begs the question... what version of libc6 are you running!? Here's >>> the output from one of our upgraded boxes: >>> >>> $ dpkg --list | grep libc6 >>> ii libc62.11.1-0ubuntu7.7 >>> Embedded GNU C Library: Shared libraries >>> ii libc6-dev2.11.1-0ubuntu7.7 >>> Embedded GNU C Library: Development Librarie >>> >>> Before upgrading the version field showed 2.11.1-0ubuntu7.5. Wondering >>> what yours is. >>> >>> We also found ourselves in a similar situation with different regions. >>> We're using the canonical ubuntu ami as the base for our systems. But there >>> appear to be small differences between the packages included in the amis >>> from different regions. Seems libc6 is one of the things that changed. I >>> discovered by diff'ing `dpkg --list` on a node that was good, and one that >>> was bad. >>> >>> The architecture hypothesis is also very interesting. If we could >>> reproduce the bug with the latest libc6 build I'd escalate it back up to >>> Amazon. But I can't repro it, so nothing to escalate. >>> >>> For what it's worth, we were able to reproduce the lockup behavior that >>> you're describing by running a tight loop that spawns threads. Here's a gist >>> of the app I used: https://gist.github.com/a4123705e67e9446f1cc -- I'd >>> be interested to know whether that locks things up on your system with a new >>> libc6. >>> >>> Mike >>> >>> On Thu, Jan 13, 2011 at 10:39 PM, Erik Onnen wrote: >>> >>>> May or may not be related but I thought I'd recount a similar experience >>>> we had in EC2 in hopes it helps someone else. >>>> >>>> As background, we had been running several servers in a 0.6.8 ring with >>>> no Cassandra issues (some EC2 issues, but none related to Cassandra) on >>>> multiple EC2 XL instances in a single availability zone. We decided to add >>>> several other nodes to a second AZ for reasons beyond the scope of this >>>> email. As we reached steady operational state in the new AZ, we noticed >>>> that >>>> the new nodes in the new AZ were repeatedly getting dropped from the ring. >>>> At first we attributed the drops to phi and expected cross-AZ latency. As >>>> we >>>> tried to pinpoint the issue, we found something very similar to wha
Re: Cassandra taking 100% cpu over multiple cores(0.7.0)
One of the developers will have to confirm but this looks like a bug to me. MessagingService is a singleton and there's a Multimap used for targets that isn't accessed in a thread safe manner. The thread dump would seem to confirm this, when you hammer what is ultimately a standard HashMap with multiple readers while a write happens, there's a race that can be triggered where all subsequent reads on that map will get stuck. Your thread dump would seem to indicate that is happening with many (>50) threads trying to read from what appears to be the same HashMap. "pool-1-thread-6451" prio=10 tid=0x7fa5242c9000 nid=0x10f4 runnable [0x7fa52fde4000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.get(HashMap.java:303) at com.google.common.collect.AbstractMultimap.getOrCreateCollection(AbstractMultimap.java:205) at com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:194) at com.google.common.collect.AbstractListMultimap.put(AbstractListMultimap.java:72) at com.google.common.collect.ArrayListMultimap.put(ArrayListMultimap.java:60) at org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:303) at org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:353) at org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:229) at org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:98) at org.apache.cassandra.thrift.CassandraServer.get(CassandraServer.java:289) at org.apache.cassandra.thrift.Cassandra$Processor$get.process(Cassandra.java:2655) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) When two threads get to this state, they won't ever come back and any other thread that tries to read that HashMap will also get wedged (looks like that's happened in your case). On Sun, Jan 23, 2011 at 10:11 AM, Thibaut Britz wrote: > > Hi, > > After a while some of my cassandra nodes (out of a pool of 20) get > unresponsive. All cores (8) are being used to 100% and uptime load > climbs to 80. > > I first suspected the gargabe collector to force the node to 100% cpu > usage , but the problem even occurs with the default garbage > collector. A nodetool info also reveals that the heap isn't even > completely used. > > The node has JNA enabled. There are no ERROR message in the logfile > (except CASSANDRA-1959 which happened a few days before. But I > restarted the node afterwards). Cassandra runs on non virtualized > hardware. (e.g. no amazon ec2). The version I use is 0.7.0. I haven't > applied any patches (e.g. CASSANDRA-1959). > > I attached the jstack of the process. (there are many many threads) > > Thanks, > Thibaut > > > ./nodetool -h localhost info > ccc > Load : 1.66 GB > Generation No : 1295609201 > Uptime (seconds) : 196155 > Heap Memory (MB) : 1795.55 / 3912.06 > > > root 2870 730 57.4 6764500 4703712 ? SLl Jan21 23913:00 > /usr/bin/java -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 > -Xms3997M -Xmx3997M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseLargePages > -XX:+UseCompressedOops -Xss128k -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true > -Dcom.sun.management.jmxremote.port=8080 > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dlog4j.configuration=log4j-server.properties -cp > /software/cassandra/bin/../conf:/software/cassandra/bin/../build/classes:/software/cassandra/bin/../lib/antlr-3.1.3.jar:/software/cassandra/bin/../lib/apache-cassandra-0.7.0.jar:/software/cassandra/bin/../lib/avro-1.4.0-fixes.jar:/software/cassandra/bin/../lib/avro-1.4.0-sources-fixes.jar:/software/cassandra/bin/../lib/commons-cli-1.1.jar:/software/cassandra/bin/../lib/commons-codec-1.2.jar:/software/cassandra/bin/../lib/commons-collections-3.2.1.jar:/software/cassandra/bin/../lib/commons-lang-2.4.jar:/software/cassandra/bin/../lib/concurrentlinkedhashmap-lru-1.1.jar:/software/cassandra/bin/../lib/guava-r05.jar:/software/cassandra/bin/../lib/high-scale-lib.jar:/software/cassandra/bin/../lib/ivy-2.1.0.jar:/software/cassandra/bin/../lib/jackson-core-asl-1.4.0.jar:/software/cassandra/bin/../lib/jackson-mapper-asl-1.4.0.jar:/software/cassandra/bin/../lib/jetty-6.1.21.jar:/software/cassandra/bin/../lib/jetty-util-6.1.21.jar:/software/cassandra/bin/../lib/jline-0.9.94.jar:/software/cassandra/bin/../lib/jna.jar:/software/cassandra/bin/../lib/json-simple-1.1.jar:/software/cassand
Re: Cassandra taking 100% cpu over multiple cores(0.7.0)
Filed as https://issues.apache.org/jira/browse/CASSANDRA-2037 I can't see how the code would be correct as written but I'm usually wrong about most things. On Sun, Jan 23, 2011 at 12:14 PM, Erik Onnen wrote: > One of the developers will have to confirm but this looks like a bug > to me. MessagingService is a singleton and there's a Multimap used for > targets that isn't accessed in a thread safe manner. > > The thread dump would seem to confirm this, when you hammer what is > ultimately a standard HashMap with multiple readers while a write > happens, there's a race that can be triggered where all subsequent > reads on that map will get stuck. > > Your thread dump would seem to indicate that is happening with many > (>50) threads trying to read from what appears to be the same HashMap. > > "pool-1-thread-6451" prio=10 tid=0x7fa5242c9000 nid=0x10f4 > runnable [0x7fa52fde4000] > java.lang.Thread.State: RUNNABLE > at java.util.HashMap.get(HashMap.java:303) > at > com.google.common.collect.AbstractMultimap.getOrCreateCollection(AbstractMultimap.java:205) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:194) > at > com.google.common.collect.AbstractListMultimap.put(AbstractListMultimap.java:72) > at > com.google.common.collect.ArrayListMultimap.put(ArrayListMultimap.java:60) > at > org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:303) > at > org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:353) > at > org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:229) > at > org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:98) > at > org.apache.cassandra.thrift.CassandraServer.get(CassandraServer.java:289) > at > org.apache.cassandra.thrift.Cassandra$Processor$get.process(Cassandra.java:2655) > at > org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555) > at > org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > > When two threads get to this state, they won't ever come back and any > other thread that tries to read that HashMap will also get wedged > (looks like that's happened in your case). > > On Sun, Jan 23, 2011 at 10:11 AM, Thibaut Britz > wrote: >> >> Hi, >> >> After a while some of my cassandra nodes (out of a pool of 20) get >> unresponsive. All cores (8) are being used to 100% and uptime load >> climbs to 80. >> >> I first suspected the gargabe collector to force the node to 100% cpu >> usage , but the problem even occurs with the default garbage >> collector. A nodetool info also reveals that the heap isn't even >> completely used. >> >> The node has JNA enabled. There are no ERROR message in the logfile >> (except CASSANDRA-1959 which happened a few days before. But I >> restarted the node afterwards). Cassandra runs on non virtualized >> hardware. (e.g. no amazon ec2). The version I use is 0.7.0. I haven't >> applied any patches (e.g. CASSANDRA-1959). >> >> I attached the jstack of the process. (there are many many threads) >> >> Thanks, >> Thibaut >> >> >> ./nodetool -h localhost info >> ccc >> Load : 1.66 GB >> Generation No : 1295609201 >> Uptime (seconds) : 196155 >> Heap Memory (MB) : 1795.55 / 3912.06 >> >> >> root 2870 730 57.4 6764500 4703712 ? SLl Jan21 23913:00 >> /usr/bin/java -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 >> -Xms3997M -Xmx3997M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseLargePages >> -XX:+UseCompressedOops -Xss128k -XX:SurvivorRatio=8 >> -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 >> -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true >> -Dcom.sun.management.jmxremote.port=8080 >> -Dcom.sun.management.jmxremote.ssl=false >> -Dcom.sun.management.jmxremote.authenticate=false >> -Dlog4j.configuration=log4j-server.properties -cp >> /software/cassandra/bin/../conf:/software/cassandra/bin/../build/classes:/software/cassandra/bin/../lib/antlr-3.1.3.jar:/software/cassandra/bin/../lib/apache-cassandra-0.7.0.jar:/software/cassandra/bin/../lib/avro-1.4.0-fixes.jar:/software/cassandra/bin/../lib/avro-1.4.0-
CompactionExecutor EOF During Bootstrap
Stuff/stuffid_reg_idx-f-4 The same behavior has happened on both attempts. Logs from the node giving up tokens show activity by the StreamStage thread but after the failure on the bootstrapping node not much else relative to the stream. Lastly, the behavior in both cases seems to have issue with the third data file. Files f-1,f-2 and f-4 are present but f-3 is not. Any help would be appreciated. -erik
Re: CompactionExecutor EOF During Bootstrap
Thanks Jonathan. Filed: https://issues.apache.org/jira/browse/CASSANDRA-2283 We'll start the scrub during our normal compaction cycle and update this thread and the bug with the results. -erik On Mon, Mar 7, 2011 at 11:27 AM, Jonathan Ellis wrote: > It sounds like it doesn't realize the data it's streaming over is > older-version data. Can you create a ticket? > > In the meantime nodetool scrub (on the existing nodes) will rewrite > the data in the new format which should workaround the problem. > > On Mon, Mar 7, 2011 at 1:23 PM, Erik Onnen wrote: >> During a recent upgrade of our cassandra ring from 0.6.8 to 0.7.3 and >> prior to a drain on the 0.6.8 nodes, we lost a node for reasons >> unrelated to cassandra. We decided to push forward with the drain on >> the remaining healthy nodes. The upgrade completed successfully for >> the remaining nodes and the ring was healthy. However, we're unable to >> boostrap in a new node. The bootstrap process starts and we can see >> streaming activity in the logs for the node giving up tokens, but the >> bootstrapping node encounters the following: >> >> >> INFO [main] 2011-03-07 10:37:32,671 StorageService.java (line 505) >> Joining: sleeping 3 ms for pending range setup >> INFO [main] 2011-03-07 10:38:02,679 StorageService.java (line 505) >> Bootstrapping >> INFO [HintedHandoff:1] 2011-03-07 10:38:02,899 >> HintedHandOffManager.java (line 304) Started hinted handoff for >> endpoint /10.211.14.200 >> INFO [HintedHandoff:1] 2011-03-07 10:38:02,900 >> HintedHandOffManager.java (line 360) Finished hinted handoff of 0 rows >> to endpoint /10.211.14.200 >> INFO [CompactionExecutor:1] 2011-03-07 10:38:04,924 >> SSTableReader.java (line 154) Opening >> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuff-f-1 >> INFO [CompactionExecutor:1] 2011-03-07 10:38:05,390 >> SSTableReader.java (line 154) Opening >> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuff-f-2 >> INFO [CompactionExecutor:1] 2011-03-07 10:38:05,768 >> SSTableReader.java (line 154) Opening >> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid-f-1 >> INFO [CompactionExecutor:1] 2011-03-07 10:38:06,389 >> SSTableReader.java (line 154) Opening >> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid-f-2 >> INFO [CompactionExecutor:1] 2011-03-07 10:38:06,581 >> SSTableReader.java (line 154) Opening >> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid-f-3 >> ERROR [CompactionExecutor:1] 2011-03-07 10:38:07,056 >> AbstractCassandraDaemon.java (line 114) Fatal exception in thread >> Thread[CompactionExecutor:1,1,main] >> java.io.EOFException >> at >> org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65) >> at >> org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:303) >> at >> org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:923) >> at >> org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:916) >> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:662) >> INFO [CompactionExecutor:1] 2011-03-07 10:38:08,480 >> SSTableReader.java (line 154) Opening >> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid-f-5 >> INFO [CompactionExecutor:1] 2011-03-07 10:38:08,582 >> SSTableReader.java (line 154) Opening >> /mnt/services/cassandra/var/data/0.7.3/data/Stuff/stuffid_reg_idx-f-1 >> ERROR [CompactionExecutor:1] 2011-03-07 10:38:08,635 >> AbstractCassandraDaemon.java (line 114) Fatal exception in thread >> Thread[CompactionExecutor:1,1,main] >> java.io.EOFException >> at >> org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65) >> at >> org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:303) >> at >> org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:923) >> at >> org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:916) >> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> at >> java.util.concurrent.Thread
Re: Aamzon EC2 & Cassandra to ebs or not..
I'd recommend not storing commit logs or data files on EBS volumes if your machines are under any decent amount of load. I say that for three reasons. First, both EBS volumes contend directly for network throughput with what appears to be a peer QoS policy to standard packets. In other words, if you're saturating a network link, EBS throughput falls. The same has not been true of ephemeral volumes in all of our testing, ephemeral I/O speeds tend to only take a minor hit under network pressure and are consistently faster in raw speed tests. Second, at some point it's a given that you will encounter misbehaving EBS volumes. They won't completely fail, worse they will just get really, really slow. Often times this is worse than a total failure because the system just back piles reads/writes but doesn't totally fall over until the entire cluster becomes overwhelmed. We've never had single volume ephemeral problems. Lastly, I think people have a tendency to bolt a large number of EBS volumes to a host and think that because they have disk capacity they serve more data from fewer hosts. If you push that too far, you'll outstrip the ability of the system to keep effective buffer caches and concurrently serve requests for all the data it is responsible for managing. IME there is pretty good parity between an EC2 XL and the ephemeral disks available relative to how Cassandra uses disk and RAM that adding more storage is right at the breaking point of over committing your hardware. If you want protection from AZ failure, split you ring across AZs (Cassandra is quite good at this) or copy snapshots to EBS volumes. -erik There are a lot of benefits to EBS volumes, I/O throughput and reliability are not among those benefits. On Wed, Mar 9, 2011 at 8:39 AM, William Oberman wrote: > I thought nodetool snapshot writes the snapshot locally, requiring 2x of > expensive storage allocation 24x7 (vs. cheap storage allocation of a ebs > snapshot). By that I mean EBS allocation is GB allocated per month costs at > one rate, and EBS snapshots are delta compressed copies to S3. > > Can you point the snapshot to an external filesystem? > > will > > On Wed, Mar 9, 2011 at 11:31 AM, Sasha Dolgy wrote: >> >> Could you not nodetool snapshot the data into an mounted ebs/s3 bucket and >> satisfy your development requirement? >> -sd >> >> On Wed, Mar 9, 2011 at 5:23 PM, William Oberman >> wrote: >>> >>> For me, to transition production data into a development environment for >>> real world testing. Also, backups are never a bad idea, though I agree most >>> all risk is mitigated due to cassandra's design. >>> >>> will > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com >
Re: FW: Very slow batch insert using version 0.7.2
I see the same behavior with smaller batch sizes. It appears to happen when starting Cassandra with the defaults on relatively large systems. Attached is a script I created to reproduce the problem. (usage: mutate.sh /path/to/apache-cassandra-0.7.3-bin.tar.gz) It extracts a stock cassandra distribution to a temp dir and starts it up, (single node) then creates a keyspace with a column family and does a batch mutate to insert 5000 columns. When I run it on my laptop (Fedora 14, 64-bit, 4 cores, 8GB RAM) it flushes one Memtable with 5000 operations When I run it on a server (RHEL5, 64-bit, 16 cores, 96GB RAM) it flushes 100 Memtables with anywhere between 1 operation and 359 operations (35 bytes and 12499 bytes) I'm guessing I can override the JVM memory parameters to avoid the frequent flushing on the server, but I haven't experimented with that yet. The only difference in the effective command line between the laptop and server is "-Xms3932M -Xmx3932M -Xmn400M" on the laptop and "-Xms48334M -Xmx48334M -Xmn1600M" on the server. -- Erik Forkalsrud Commission Junction On 03/10/2011 09:18 AM, Ryan King wrote: Why use such a large batch size? -ryan On Thu, Mar 10, 2011 at 6:31 AM, Desimpel, Ignace wrote: Hello, I had a demo application with embedded cassandra version 0.6.x, inserting about 120 K row mutations in one call. In version 0.6.x that usually took about 5 seconds, and I could repeat this step adding each time the same amount of data. Running on a single CPU computer, single hard disk, XP 32 bit OS, 1G memory I tested this again on CentOS 64 bit OS, 6G memory, different settings of memtable_throughput_in_mb and memtable_operations_in_millions. Also tried version 0.7.3. Also the same behavior. Now with version 0.7.2 the call returns with a timeout exception even using a timeout of 12 (2 minutes). I see the CPU time going to 100%, a lot of disk writing ( giga bytes), a lot of log messages about compacting, flushing, commitlog, … Below you can find some information using the nodetool at start of the batch mutation and also after 14 minutes. The MutationStage is clearly showing how slow the system handles the row mutations. Attached : Cassandra.yaml with at end the description of my database structure using yaml Attached : log file with cassandra output. Any idea what I could be doing wrong? Regards, Ignace Desimpel ignace.desim...@nuance.com At start of the insert (after inserting 124360 row mutations) I get the following info from the nodetool : C:\apache-cassandra-07.2\bin>nodetool --host ads.nuance.com info Starting NodeTool 34035877798200531112672274220979640561 Gossip active: true Load : 5.49 MB Generation No: 1299502115 Uptime (seconds) : 1152 Heap Memory (MB) : 179,84 / 1196,81 C:\apache-cassandra-07.2\bin>nodetool --host ads.nuance.com tpstats Starting NodeTool Pool NameActive Pending Completed ReadStage 0 0 40637 RequestResponseStage 0 0 30 MutationStage32121679 72149 GossipStage 0 0 0 AntiEntropyStage 0 0 0 MigrationStage0 0 1 MemtablePostFlusher 0 0 6 StreamStage 0 0 0 FlushWriter 0 0 5 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 0 0 0 After 14 minutes (timeout exception after 2 minutes : see log file) I get : C:\apache-cassandra-07.2\bin>nodetool --host ads.nuance.com info Starting NodeTool 34035877798200531112672274220979640561 Gossip active: true Load : 10.31 MB Generation No: 1299502115 Uptime (seconds) : 2172 Heap Memory (MB) : 733,82 / 1196,81 C:\apache-cassandra-07.2\bin>nodetool --host ads.nuance.com tpstats Starting NodeTool Pool NameActive Pending Completed ReadStage 0 0 40646 RequestResponseStage 0 0 30 MutationStage32103310 90526 GossipStage 0 0 0 AntiEntropyStage 0 0 0 MigrationStage0 0 1 MemtablePostFlusher 0 0 69 StreamStage 0 0 0 FlushWriter 0 0 68 FILEUTILS-DELETE-POOL 0 0 42 MiscStage
Re: FW: Very slow batch insert using version 0.7.2
On 03/11/2011 04:56 AM, Zhu Han wrote: When I run it on my laptop (Fedora 14, 64-bit, 4 cores, 8GB RAM) it flushes one Memtable with 5000 operations When I run it on a server (RHEL5, 64-bit, 16 cores, 96GB RAM) it flushes 100 Memtables with anywhere between 1 operation and 359 operations (35 bytes and 12499 bytes) What's the settings of commit log flush, periodic or in batch? It's whatever the default setting is, (in the cassandra.yaml that is packaged in the apache-cassandra-0.7.3-bin.tar.gz download) specifically: commitlog_rotation_threshold_in_mb: 128 commitlog_sync: periodic commitlog_sync_period_in_ms: 1 flush_largest_memtables_at: 0.75 If I describe keyspace I get: [default@unknown] describe keyspace Events; Keyspace: Events: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Replication Factor: 1 Column Families: ColumnFamily: Event Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/14400 Memtable thresholds: 14.109375/3010/1440 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [] It turns out my suspicion was right. When I tried overriding the jvm memory parameters calculated in conf/cassandra-env.sh to use the values calculated on my 8GB laptop like this: MAX_HEAP_SIZE=3932m HEAP_NEWSIZE=400m ./mutate.sh That made the server behave much nicer. This time it kept all 5000 operations in a single Memtable. Also, when running with these memory settings the Memtable thresholds changed to "1.1390625/243/1440" (from "14.109375/3010/1440") (all the other output from "describe keyspace" remains the same) So it looks like something goes wrong when cassandra gets too much memory. -- Erik Forkalsrud Commission Junstion
Re: FW: Very slow batch insert using version 0.7.2
On 03/11/2011 12:13 PM, Jonathan Ellis wrote: https://issues.apache.org/jira/browse/CASSANDRA-2158, fixed in 0.7.3 you could have saved a lot of time just by upgrading first. :) Hmm, I'm testing with 0.7.3 ... but now I know at least which knob to turn. - Erik -
Re: FW: Very slow batch insert using version 0.7.2
On 03/11/2011 12:13 PM, Jonathan Ellis wrote: https://issues.apache.org/jira/browse/CASSANDRA-2158, fixed in 0.7.3 you could have saved a lot of time just by upgrading first. :) It looks like the fix isn't entirely correct. The bug is still in 0.7.3. In Memtable.java, the line: THRESHOLD = cfs.getMemtableThroughputInMB() * 1024 * 1024; should be changed to: THRESHOLD = cfs.getMemtableThroughputInMB() * 1024L * 1024L; Here's some code that illustrates the difference: public void testMultiplication() { int memtableThroughputInMB = 2300; long thresholdA = memtableThroughputInMB * 1024 * 1024; long thresholdB = memtableThroughputInMB * 1024L * 1024L; System.out.println("a=" + thresholdA + " b=" + thresholdB); } - Erik - On Fri, Mar 11, 2011 at 2:02 PM, Erik Forkalsrud wrote: On 03/11/2011 04:56 AM, Zhu Han wrote: When I run it on my laptop (Fedora 14, 64-bit, 4 cores, 8GB RAM) it flushes one Memtable with 5000 operations When I run it on a server (RHEL5, 64-bit, 16 cores, 96GB RAM) it flushes 100 Memtables with anywhere between 1 operation and 359 operations (35 bytes and 12499 bytes) What's the settings of commit log flush, periodic or in batch? It's whatever the default setting is, (in the cassandra.yaml that is packaged in the apache-cassandra-0.7.3-bin.tar.gz download) specifically: commitlog_rotation_threshold_in_mb: 128 commitlog_sync: periodic commitlog_sync_period_in_ms: 1 flush_largest_memtables_at: 0.75 If I describe keyspace I get: [default@unknown] describe keyspace Events; Keyspace: Events: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Replication Factor: 1 Column Families: ColumnFamily: Event Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/14400 Memtable thresholds: 14.109375/3010/1440 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [] It turns out my suspicion was right. When I tried overriding the jvm memory parameters calculated in conf/cassandra-env.sh to use the values calculated on my 8GB laptop like this: MAX_HEAP_SIZE=3932m HEAP_NEWSIZE=400m ./mutate.sh That made the server behave much nicer. This time it kept all 5000 operations in a single Memtable. Also, when running with these memory settings the Memtable thresholds changed to "1.1390625/243/1440" (from "14.109375/3010/1440") (all the other output from "describe keyspace" remains the same) So it looks like something goes wrong when cassandra gets too much memory. -- Erik Forkalsrud Commission Junstion
SSTable Corruption
After an upgrade from 0.7.3 to 0.7.4, we're seeing the following on several data files: ERROR [main] 2011-03-23 18:58:33,137 ColumnFamilyStore.java (line 235) Corrupt sstable /mnt/services/cassandra/var/data/0.7.4/data/Helium/dp_idx-f-4844=[Index.db, Statistics.db, Data.db, Filter.db]; skipped java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:207) at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:166) at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:226) at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:472) at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:453) at org.apache.cassandra.db.Table.initCf(Table.java:317) at org.apache.cassandra.db.Table.(Table.java:254) at org.apache.cassandra.db.Table.open(Table.java:110) at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:160) at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314) at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79) Prior to the upgrade, there weren't problems opening these data files with the exception that we were not able to bootstrap new nodes into the ring due to what we believe was caused by CASSANDRA-2283. In trying to fix 2283, we did run a scrub on this node. With an RF of 3 in our ring, what are the implications of this? Specifically, 1) If the node is a natural endpoint for rows in the corrupt sstable, will it simply not respond to requests for that row or will there be errors? 2) Should a repair fix the broken sstable? This was my expectation based on similar threads on the mailing list. 3) If #2 is correct and the repair did not fix the corrupt sstables (which it did not), how should we proceed to repair the tables? Thanks, -erik
Re: SSTable Corruption
Thanks, so is it the "[Index.db, Statistics.db, Data.db, Filter.db]; skipped" that indicates it's in Statistics? Basically I need a way to know if the same is true of all the other tables showing this issue. -erik
Re: ParNew (promotion failed)
It's been about 7 months now but at the time G1 would regularly segfault for me under load on Linux x64. I'd advise extra precautions in testing and make sure you test with representative load.
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
I'll capture what I we're seeing here for anyone else who may look into this in more detail later. Our standard heap growth is ~300K in between collections with regular ParNew collections happening on average about every 4 seconds. All very healthy. The memtable flush (where we see almost all our CMS activity) seems to have some balloon effect that despite a 64MB memtable size, causes over 512MB heap to be consumed in half a second. In addition to the hefty amount of garbage it causes, due to the MaxTenuringThreshold=1 setting most of that garbage seems to spill immediately into the tenured generation which quickly fills and triggers a CMS. The rate of garbage overflowing to tenured seems to outstrip the speed of the concurrent mark worker which is almost always interrupted and failed to a concurrent collection. However, the tenured collection is usually hugely effective, recovering over half the total heap. Two questions for the group then: 1) Does this seem like a sane amount of garbage (512MB) to generate when flushing a 64MB table to disk? 2) Is this possibly a case of the MaxTenuringThreshold=1 working against cassandra? The flush seems to create a lot of garbage very quickly such that normal CMS isn't even possible. I'm sure there was a reason to introduce this setting but I'm not sure it's universally beneficial. Is there any history on the decision to opt for immediate promotion rather than using an adaptable number of survivor generations?
CL.ONE reads and SimpleSnitch unnecessary timeouts
Sorry for the complex setup, took a while to identify the behavior and I'm still not sure I'm reading the code correctly. Scenario: Six node ring w/ SimpleSnitch and RF3. For the sake of discussion assume the token space looks like: node-0 1-10 node-1 11-20 node-2 21-30 node-3 31-40 node-4 41-50 node-5 51-60 In this scenario we want key 35 where nodes 3,4 and 5 are natural endpoints. Client is connected to node-0, node-1 or node-2. node-3 goes into a full GC lasting 12 seconds. What I think we're seeing is that as long as we read with CL.ONE *and* are connected to 0,1 or 2, we'll never get a response for the requested key until the failure detector kicks in and convicts 3 resulting in reads spilling over to the other endpoints. We've tested this by switching to CL.QUORUM and since haven't seen read timeouts during big GCs. Assuming the above, is this behavior really correct? We have copies of the data on two other nodes but because this snitch config always picks node-3, we always timeout until conviction which can take up to 8 seconds sometimes. Shouldn't the read attempt to pick a different endpoint in the case of the first timeout rather than repeatedly trying a node that isn't responding? Thanks, -erik
Re: CL.ONE reads and SimpleSnitch unnecessary timeouts
So we're not currently using a dynamic snitch, only the SimpleSnitch is at play (lots of history as to why, I won't go into it). If this would solve our problems I'm fine changing it. Understood re: client contract. I guess in this case my issue is that the server we're connected to never tries more than the one failing server until failure detector has kicked in - it keeps flogging the bad server so subsequent requests never produce a different result until conviction. Regarding clients retrying, in this configuration the situation doesn't improve and it still times out because our client libraries don't try another host. They still have a valid connection to a working host, it's just that given our configuration that one node keeps proxying to a bad server and never routes around it. It sounds like switching to the dynamic switch would adjust for the first timeout on subsequent attempts so maybe that's the most advisable thing in this case. On Wed, Apr 13, 2011 at 10:58 AM, Jonathan Ellis wrote: > First, our contract with the client says "we'll give you the answer or > a timeout after rpc_timeout." Once we start trying to cheat on that > the client has no guarantee anymore when it should expect a response > by. So that feels iffy to me. > > Second, retrying to a different node isn't expected to give > substantially better results than the client issuing a retry itself if > that's what it wants, since by the time we timeout once then FD and/or > dynamic snitch should route the request to another node for the retry > without adding additional complexity to StorageProxy. (If that's not > what you see in practice, then we probably have a dynamic snitch bug.) > > On Wed, Apr 13, 2011 at 12:32 PM, Erik Onnen wrote: >> Sorry for the complex setup, took a while to identify the behavior and >> I'm still not sure I'm reading the code correctly. >> >> Scenario: >> >> Six node ring w/ SimpleSnitch and RF3. For the sake of discussion >> assume the token space looks like: >> >> node-0 1-10 >> node-1 11-20 >> node-2 21-30 >> node-3 31-40 >> node-4 41-50 >> node-5 51-60 >> >> In this scenario we want key 35 where nodes 3,4 and 5 are natural >> endpoints. Client is connected to node-0, node-1 or node-2. node-3 >> goes into a full GC lasting 12 seconds. >> >> What I think we're seeing is that as long as we read with CL.ONE *and* >> are connected to 0,1 or 2, we'll never get a response for the >> requested key until the failure detector kicks in and convicts 3 >> resulting in reads spilling over to the other endpoints. >> >> We've tested this by switching to CL.QUORUM and since haven't seen >> read timeouts during big GCs. >> >> Assuming the above, is this behavior really correct? We have copies of >> the data on two other nodes but because this snitch config always >> picks node-3, we always timeout until conviction which can take up to >> 8 seconds sometimes. Shouldn't the read attempt to pick a different >> endpoint in the case of the first timeout rather than repeatedly >> trying a node that isn't responding? >> >> Thanks, >> -erik >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >
Error in TBaseHelper compareTo(byte [] a , byte [] b)
Hey! We are currently using Cassandra 0.5.1 and I'm getting a StackOverflowError when comparing two ColumnOrSuperColumn objects. It turns out the the comparTo function for byte [] has an infinite loop in libthrift-r820831.jar. We are planning to upgrade to 0.6.1 but not ready to do it today', so just wanted to check if it is possible to get a jar where that bug has been fixed that works with 0.5, so we can just replace it? -- Regards Erik
Re: Error in TBaseHelper compareTo(byte [] a , byte [] b)
Thanks Jonathan! Yeah, I will just wait until we are ready for upgrade and hold of on that project for now. Erik
Re: Batch mutate doesn't work
Try this. You need to use the dict module. MutationMap = dict:store(Key, dict:store("KeyValue", [ #mutation {column_or_supercolumn = #columnOrSuperColumn {column = #column {name = "property", value="value", timestamp = 2}}} ], dict:new()), dict:new()), Erik On 30 apr 2010, at 15.58, Zubair Quraishi wrote: > I have the following code in Erlang to set a value and then add a > property. The first set works but the mutate fails. Can anyone > enlighten me? > Thanks > > {ok, C} = thrift_client:start_link("127.0.0.1",9160, cassandra_thrift), > > Key = "Key1", > > % > % set first property > % > thrift_client:call( C, > 'insert', > [ "Keyspace1", >Key, >#columnPath{column_family="KeyValue", column="value"}, >"value1", >1, >1 >] ), > > % > % set second property ( fails! - why? ) > % > MutationMap = > { > Key, > { > <<"KeyValue">>, > [ > #mutation{ > column_or_supercolumn = #column{ name = "property" , value = > "value" , timestamp = 2 } > } > ] > } > }, > thrift_client:call( C, > 'batch_mutate', > [ "Keyspace1", > MutationMap, > 1 > ] ) > > : The error returned is : > > ** exception exit: {bad_return_value,{error,{function_clause,[{dict,size, > [{"Key1", > > {<<"KeyValue">>, > > [{mutation,{column,"property","value",2},undefined}]}}]}, > > {thrift_protocol,write,2}, > > {thrift_protocol,struct_write_loop,3}, > > {thrift_protocol,write,2}, > > {thrift_client,send_function_call,3}, > > {thrift_client,'-handle_call/3-fun-0-',3}, > > {thrift_client,catch_function_exceptions,2}, > > {thrift_client,handle_call,3}]}}} > 7> > =ERROR REPORT 30-Apr-2010::15:13:42 === > ** Generic server <0.55.0> terminating > ** Last message in was {call,batch_mutate, > ["Keyspace1", > {"Key1", >{<<"KeyValue">>, > [{mutation, > {column,"property","value",2}, > undefined}]}}, > 1]} > ** When Server state == {state,cassandra_thrift, >{protocol,thrift_binary_protocol, > {binary_protocol, > {transport,thrift_buffered_transport,<0.58.0>}, > true,true}}, >0} > ** Reason for termination == > ** {bad_return_value, > {error, > {function_clause, > [{dict,size, > [{"Key1", > {<<"KeyValue">>, > [{mutation, > {column,"property","value",2}, > undefined}]}}]}, > {thrift_protocol,write,2}, > {thrift_protocol,struct_write_loop,3}, > {thrift_protocol,write,2}, > {thrift_client,send_function_call,3}, > {thrift_client,'-handle_call/3-fun-0-',3}, > {thrift_client,catch_function_exceptions,2}, > {thrift_client,handle_call,3}]}}}
Re: non blocking Cassandra with Tornado
On Thu, Jul 29, 2010 at 9:57 PM, Ryan Daum wrote: > > Barring this we (place where I work, Chango) will probably eventually fork > Cassandra to have a RESTful interface and use the Jetty async HTTP client to > connect to it. It's just ridiculous for us to have threads and associated > resources tied up on I/O-blocked operations. > We've done exactly this but with Netty rather than Jetty. Helps too because we can easily have testers look at what we put into our CFs. Took some cheating to marshall the raw binaries into JSON but I'm pretty happy with what it's bought us so far.
Cassandra suitability for model?
...or perhaps vice versa: how would I tweak a model to suit Cassandra? I have in mind data that could be _almost_ shoehorned into the (S)CF structure, and I'd love to hammer this nail with something hadoopy, but I have a niggling suspicion I'm setting myself up for frustration. I have * a relatively small set of primary tags (dozens or hundreds per cluster) * under each primary tag a large number (on the scale of 1E6) of arbitrary length hierarchical paths ("foo/bar/xyzzy", typically consisting of descriptive labels, usually totaling 20-40 chars) * under each path an arbitrary number (usually a few or a few dozen, but in some systematic cases ~1000) of leaf tags (typically descriptive labels, say 4-16 chars in length) * under each leaf tag a value (arbitrary; string, number, perhaps binary) On the surface, it would seem that the primary tag would correspond well with Supercolumn keys, the intermediate path with ColumnFamily names, and the final key-value-pairs with Columns. Any warning bells here? (Seems like I could also use the primary tag as a Keyspace name, but I seem to recall some warnings about using excessive keyspaces.) The gist is that each and every leaf tag, across the whole data set, receives a value every few seconds, indefinitely, and history must be preserved. In practice, all Columns in the ColumnFamily receive a value at the same time. 100k ColumnFamily updates a second would be routine. Nodes would be added whenever storage or per-node I/O became an issue. A query is a much more rare occurrence, and would nearly always involve retrieving the full contents of a ColumnFamily over some time range (usually thousands of snapshots, not at all rarely millions). Just by browsing online documentation I can't resolve whether this timestamping could work in conjunction with Cassandra's internal native timestamping as is. Is it possible - out of the box, or with minor coding - to retain history, and to incorporate time ranges into queries? If so, does the distributed storage cope with the accumulating data of a single ColumnFamily flowing over to new nodes? Or: should I twist the whole thing around and incorporate my timestamp into the ColumnFamily identifier to enjoy automatic scaling? Would the sheer number of resulting identifiers become a performance issue? Thanks for your comments; //e
0.6 to 0.7 upgrade assumptions
Hello, We're planning an upgrade from 0.6.7 (after it's released) to 0.7.0 (after it's released) and I wanted to validate my assumptions about what can be expected. Obviously I'll need to test my own assumptions but I was hoping for some guidance to make sure my understanding is correct. My core assumptions are: 1) After taking the upstream services offline and inducing a flush on 0.6 nodes so that all data is compacted, I should be able to start 0.7 binaries on top of the 0.6 data files with no issue. 2) When upgrading, I can change from RackUnawareStrategy to a PropertyFileSnitch configuration. My understanding is that read repair will kick in for the missing facility replicas at a rate roughly equivalent to the read repair chance. 3) While read repair from #2 is occurring, the now incorrect replica nodes will continue to be capable of servicing requests for the data that has not been migrated to the correct replicas. Thanks! -erik
Re: Model to store biggest score
Another approach you can take is to add the userid to the score like, => (column=140_uid2, value=[], timestamp=1268841641979) and f you need the score time sorted you can add => (column=140_268841641979_uid2, value=[], timestamp=1268841641979) But I do think that in any case you need to remove the old entry so that you don't get duplicates, unless I'm missing something here. On Wed, Mar 17, 2010 at 9:52 AM, Brandon Williams wrote: > On Wed, Mar 17, 2010 at 11:48 AM, Richard Grossman wrote: > >> But in the case of simple column family I've the same problem when I >> update the score of 1 user then I need to remove his old score too. For >> example here the user uid5 was at 130 now he is at 140 because I add the >> random number cassandra will keep all the score evolution. >> > > You can maintain another index mapping users to the values. Depending on > your use case though, if this is time-based, you can name the rows by the > date and just create new rows as time goes on. > > -Brandon > -- Regards Erik
Re: Timestamp for versioning?
If I'm not totally mistaken the timestamp is used for conflict resolution on the server. Have a look at: http://wiki.apache.org/cassandra/DataModel for more info 2010/3/23 Waaij, B.D. (Bram) van der > Can you explain then to me what the purpose is of the timestamp? > > > > -Original Message- > > From: Jonathan Ellis [mailto:jbel...@gmail.com] > > Sent: dinsdag 23 maart 2010 15:35 > > To: user@cassandra.apache.org > > Subject: Re: Timestamp for versioning? > > > > No, we're not planning to add support for retrieving old versions. > > > > 2010/3/23 Waaij, B.D. (Bram) van der : > > > Are there any plans to add versioning in the near future to > > cassandra? > > > > > > What is the purpose of the timestamp in the current implementation? > > > > > > Best regards, > > > Bram > > > > > > > > > From: 张宇 [mailto:zyxt...@gmail.com] > > > Sent: dinsdag 23 maart 2010 14:56 > > > To: user@cassandra.apache.org > > > Cc: Waaij, B.D. (Bram) van der > > > Subject: Re: Timestamp for versioning? > > > > > > HBase supports the version through the timestamp,but Cassandra not. > > > > > > 2010/3/23 Waaij, B.D. (Bram) van der > > >> > > >> Hi there, > > >> > > >> Is there already support for doing something with the timestamp? I > > >> would like to use it as a versioning mechanism of the > > value belonging > > >> to a certain key. When i store a column with the same key, > > different > > >> value and different timestamp, would i still be able to > > retrieve both versions? > > >> > > >> Looking at the API I can not find such method, but perhaps I am > > >> missing something as I am new to cassandra. > > >> > > >> Best regards, > > >> Bram > > >> > > >> This e-mail and its contents are subject to the DISCLAIMER at > > >> http://www.tno.nl/disclaimer/email.html > > > > > > This e-mail and its contents are subject to the DISCLAIMER at > > > http://www.tno.nl/disclaimer/email.html > > > > > > This e-mail and its contents are subject to the DISCLAIMER at > http://www.tno.nl/disclaimer/email.html > -- Regards Erik
Re: Auto Increament
On Wed, Mar 24, 2010 at 11:00 AM, Jesus Ibanez wrote: > You can generate UUIDs based on time with http://jug.safehaus.org/ if you > use Java. And its easy to use, just have to insert one line: > UUID uuid = UUIDGenerator.getInstance().generateTimeBasedUUID(); > > Maybe a solution to your cuestion: > > To "replace" the autoincrement of MySQL, you can create a column family of > type Super ordering super columns by decreasing order and naming the super > columns with a numeric value. So then, if you want to insert a new value, > first read the last inserted column name and add 1 to the returned value. > Then insert a new super column wich it name will be the new value. > > I'm a little bit confused about this. Where do you get the last inserted column from? Why is this better than having a fix counter row where you increment the columnValue, non of them are threadsafe? > Be careful with super columns, becouse I think that if you want to read > just a column of a super column, all columns will be deserialized. > > So you will have this: > > Super_Columns_Family_Name = { // this is a ColumnFamily name of type Super > one_key: {// this is the key to this row inside the Super CF > > n: {column1: "value 1", column2: "value 2", ... , columnN: "value n"}, > > n-1: {column1: "value 1", column2: "value 2", ... , columnN: "value > n"}, > > ... > > 2: {column1: "value 1", column2: "value 2", ... , columnN: "value n"}, > 1: {column1: "value 1", column2: "value 2", ... , columnN: "value n"}, > }, > > another_key: {// this is the key to this row inside the Super CF > > > n: {column1: "value 1", column2: "value 2", ... , columnN: "value n"}, > > n-1: {column1: "value 1", column2: "value 2", ... , columnN: "value > n"}, > > ... > > 2: {column1: "value 1", column2: "value 2", ... , columnN: "value n"}, > 1: {column1: "value 1", column2: "value 2", ... , columnN: "value n"}, > }, > > } > > The super columns that represent the autoincrement value are the ones named > as: n, n-1, ... , 2, 1. > > Hope it helps! > > Jesus. > > > > 2010/3/24 Sylvain Lebresne > > > How can I replace the "auto increament" attribute in MySQL >> > with Cassandra? >> >> You can't. Not easily at least. >> >> > If I can't, how can I generate an ID which is globally >> > unique for each of columns? >> >> Check UUIDs: >> http://en.wikipedia.org/wiki/Universally_Unique_Identifier >> >> > >> > Thanks, >> > >> > Sent from my iPhone >> > >> > > -- Regards Erik
Replicating data over the wan?
Is anyone using datacenter aware replication where the replication takes place over the wan and not over super fast optical cable between the centers? Tried to look at all posts related to the topic but haven't really found too much, only some things about not doing that if using ZooKeeper and some other small comments. What are the limitations for this kind of replication, what happens when the writes are coming in too fast for the replication to be done, assuming some kind of write buffer for the replication? Not really worried too much about the inconsistent state of the cluster, just how well it would work is this kind of environment. -- Regards Erik
Re: Replicating data over the wan?
Thanks David and Jonathan for the info. Those two links were pretty much the only thing that I did find about this issue, but is wasn't sure that only because it works for different zones it would also work for different regions. -- Regards Erik
PropertyFileEndPointSnitch
When building the PropertyFileEndPointSnitch into the jar cassandra-propsnitch.jar the files in the jar end up on src/java/org/apache/cassandra/locator/PropertyFileEndPointSnitch.class instead of org/apache/cassandra/locator/PropertyFileEndPointSnitch.class. Am I doing something wrong , is this intended behavior or is it a bug? -- Regards Erik
Re: PropertyFileEndPointSnitch
Thanks Jonathan!
Re: Best way to store millisecond-accurate data
On Fri, Apr 23, 2010 at 5:54 PM, Miguel Verde wrote: > TimeUUID's time component is measured in 100-nanosecond intervals. The > library you use might calculate it with poorer accuracy or precision, but > from a storage/comparison standpoint in Cassandra millisecond data is easily > captured by it. > > One typical way of dealing with the data explosion of sampled time series > data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you > put an upper bound on the row length. > > > On Apr 23, 2010, at 7:01 PM, Andrew Nguyen < > andrew-lists-cassan...@ucsfcti.org> wrote: > > Hello, >> >> I am looking to store patient physiologic data in Cassandra - it's being >> collected at rates of 1 to 125 Hz. I'm thinking of storing the timestamps >> as the column names and the patient/parameter combo as the row key. For >> example, Bob is in the ICU and is currently having his blood pressure, >> intracranial pressure, and heart rate monitored. I'd like to collect this >> with the following row keys: >> >> Bob-bloodpressure >> Bob-intracranialpressure >> Bob-heartrate >> >> The column names would be timestamps but that's where my questions start: >> >> I'm not sure what the best data type and CompareWith would be. From my >> searching, it sounds like the TimeUUID may be suitable but isn't really >> designed for millisecond accuracy. My other thought is just to store them >> as strings (2010-04-23 10:23:45.016). While I space isn't the foremost >> concern, we will be collecting this data 24/7 so we'll be creating many >> columns over the long-term. >> > You could just get an 8 byte millisecond timestamp and store that as a part of the key > >> I found https://issues.apache.org/jira/browse/CASSANDRA-16 which states >> that the entire row must fit in memory. Does this include the values as >> well as the column names? >> > Yes. The option is to store one insert per row, you are not going to be able to do backwards slices this way, without extra index, but you can scale mush better. > >> In considering the limits of cassandra and the best way to model this, we >> would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7). >> However, I can't really think of a better way to model this... So, am I >> thinking about this all wrong or am I on the right track? >> >> Thanks, >> Andrew >> > -- Regards Erik
Re: The Difference Between Cassandra and HBase
I would say that HBase is a little bit more focused on reads and Cassandra on writes. HBase has better scans and Cassandra better multi datacenter functionality. Erik
Re: Performance drop of current Java drivers
Matthias, Thanks for sharing your findings and test code. We were able to track this to a regression in the underlying Netty library and already have a similar issue reported here: https://datastax-oss.atlassian.net/browse/JAVA-2676 The regression seems to be with the upgrade to Netty version 4.1.45, which affects driver versions 4.5.0, 4.5.1 and 4.6.0. You can work around this issue by explicitly downgrading Netty to version 4.1.43. If you are using Maven, you can do this by explicitly declaring the version of Netty in your dependencies as follows: io.netty netty-handler 4.1.43.Final We will be addressing this issue and likely release a fixed version very soon. Many thanks again, Erik On Mon, May 4, 2020 at 6:58 AM Matthias Pfau wrote: > Hi Chris and Adam, > thanks for looking into this! > > You can find my tests for old/new client here: > https://gist.github.com/mpfau/7905cea3b73d235033e4f3319e219d15 > https://gist.github.com/mpfau/a62cce01b83b56afde0dbb588470bc18 > > > May 1, 2020, 16:22 by adam.holmb...@datastax.com: > > > Also, if you can share your schema and benchmark code, that would be a > good start. > > > > On Fri, May 1, 2020 at 7:09 AM Chris Splinter <> > chris.splinter...@gmail.com> > wrote: > > > >> Hi Matthias, > >> > >> I have forwarded this to the developers that work on the Java driver > and they will be looking into this first thing next week. > >> > >> Will circle back here with findings, > >> > >> Chris > >> > >> On Fri, May 1, 2020 at 12:28 AM Erick Ramirez <>> > erick.rami...@datastax.com>> > wrote: > >> > >>> Matthias, I don't have an answer to your question but I just wanted to > note that I don't believe the driver contributors actively watch this > mailing list (I'm happy to be corrected 🙂 ) so I'd recommend you > cross-post in the Java driver channels as well. Cheers! > >>> > > > > > > -- > > Adam Holmberg > > e.> > adam.holmb...@datastax.com > > > w.> > www.datastax.com <http://www.datastax.com> > > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > -- Erik Merkle e. erik.mer...@datastax.com w. www.datastax.com
Re: Integration tests - is anything wrong with Cassandra beta4 docker??
I opened a ticket when beta4 was released, but no Docker image was available. It turned out the reason the image wasn't published is that the tests DockerHub uses to verify the builds were timing out, so it may be a similar issue you are running into. The ticket is here: https://github.com/docker-library/cassandra/issues/221#issuecomment-761187461 DockerHub ended up altering some of the default Cassandra settings for their verification tests only, to allow the tests to pass. For your integration tests, you may want to do something similar. On Wed, Mar 10, 2021 at 5:08 AM Attila Wind wrote: > Hi Guys, > > We are using dockerized Cassandra to run our integration tests. > So far we were using the 4.0-beta2 docker image ( > https://hub.docker.com/layers/cassandra/library/cassandra/4.0-beta2/images/sha256-77aa30c8e82f0e761d1825ef7eb3adc34d063b009a634714af978268b71225a4?context=explore > <https://urldefense.proofpoint.com/v2/url?u=https-3A__hub.docker.com_layers_cassandra_library_cassandra_4.0-2Dbeta2_images_sha256-2D77aa30c8e82f0e761d1825ef7eb3adc34d063b009a634714af978268b71225a4-3Fcontext-3Dexplore&d=DwMCBA&c=adz96Xi0w1RHqtPMowiL2g&r=uHjHq8qzJoJORfwNE9cgGQeHQBiMQtuQd1uTkDPFJP0&m=cyTtaxvGkKVI7sE73eON05g5XWqhkJbmLe0kdT3u2uk&s=ICCzMRahGNP7SHgwY-VsYo5fNj1gww3OaQRLKOpRHcA&e=> > ) > Recently I tried to switch to the 4.0-beta4 docker but noticed a few > problems... > >- image starts much much slower for me (4x more time to come up and >can connect to it) >- unlike the beta2 version beta4 barely survive all of our test >cases... typically it gets stucked and fail with timeouts at around 60-80% >of completed test cases > > Anyone similar / related experiences maybe? > > thanks > -- > Attila Wind > > http://www.linkedin.com/in/attilaw > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_attilaw&d=DwMCBA&c=adz96Xi0w1RHqtPMowiL2g&r=uHjHq8qzJoJORfwNE9cgGQeHQBiMQtuQd1uTkDPFJP0&m=cyTtaxvGkKVI7sE73eON05g5XWqhkJbmLe0kdT3u2uk&s=6zQfwDS6Cxc7wA1UVRJzQ3AYwjdd1bUVfGhY4jaao9M&e=> > Mobile: +49 176 43556932 > > > -- Erik Merkle e. erik.mer...@datastax.com w. www.datastax.com