[RELEASE] Apache Cassandra 2.0.3 released
The Cassandra team is pleased to announce the release of Apache Cassandra version 2.0.3. Cassandra is a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. You can read more here: http://cassandra.apache.org/ Downloads of source and binary distributions are listed in our download section: http://cassandra.apache.org/download/ This version is a bug fix release[1] on the 2.0 series. As always, please pay attention to the release notes[2] and Let us know[3] if you were to encounter any problem. Enjoy! [1]: http://goo.gl/epkcOu (CHANGES.txt) [2]: http://goo.gl/V66rGy (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA
[RELEASE] Apache Cassandra 1.2.12 released
The Cassandra team is pleased to announce the release of Apache Cassandra version 1.2.12. Cassandra is a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. You can read more here: http://cassandra.apache.org/ Downloads of source and binary distributions are listed in our download section: http://cassandra.apache.org/download/ This version is a maintenance/bug fix release[1] on the 1.2 series. As always, please pay attention to the release notes[2] and Let us know[3] if you were to encounter any problem. Enjoy! [1]: http://goo.gl/iN7sl1 (CHANGES.txt) [2]: http://goo.gl/qgO6NI (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA
Nodetool cleanup
Hi, I'm working with Cassandra 1.2.2 and I have a question about nodetool cleanup. In the documentation , it's writted " Wait for cleanup to complete on one node before doing the next" I would like to know, why we can't perform a lot of cleanup in a same time ? Thanks
Multiple writers writing to a cassandra node...
Hello, I am a newbie to the Cassandra world. I would like to know if its possible for two different nodes to write to a single Cassandra node. I have a packet collector software which runs in two different systems. I would like both of them to write the packets to a single node(same keyspace and columnfamily). Currently using Cassandra 2.0.0 with libQtCassandra library. Currently, I am getting a IllegalRequestException, what (): Default TException on the first system, the moment I try to store from the second system, but the second system works fine. When I restart the program on the first system, the second system gets the exception and the first one works fine. Occasionally, also hitting "frame size has negative value" thrift exception when the traffic is high and packets are getting stored very fast. Can someone please point out what I am doing wrong? Thanks in advance..
Re: Nodetool cleanup
Hi Julien, I hope I get this right :) a repair will trigger a mayor compaction on your node which will take up a lot of CPU and IO performance. It needs to do this to build up the data structure that is used for the repair. After the compaction this is streamed to the different nodes in order to repair them. If you trigger this on every node simultaneously you basically take the performance away from your cluster. I would expect cassandra still to function, just way slower then before. Triggering it node after node will leave your cluster with more resources to handle incoming requests. Cheers, Artur On 25/11/13 15:12, Julien Campan wrote: Hi, I'm working with Cassandra 1.2.2 and I have a question about nodetool cleanup. In the documentation , it's writted "Wait for cleanup to complete on one node before doing the next" I would like to know, why we can't perform a lot of cleanup in a same time ? Thanks
Data loss when swapping out cluster
Hello, We recently experienced (pretty severe) data loss after moving our 4 node Cassandra cluster from one EC2 availability zone to another. Our strategy for doing so was as follows: - One at a time, bring up new nodes in the new availability zone and have them join the cluster. - One at a time, decommission the old nodes in the old availability zone and turn them off (stop the Cassandra process). Everything seemed to work as expected. As we decommissioned each node, we checked the logs for messages indicating "yes, this node is done decommissioning" before turning the node off. Pretty quickly after the old nodes left the cluster, we started getting client calls about data missing. We immediately turned the old nodes back on and when they rejoined the cluster *most* of the reported missing data returned. For the rest of the missing data, we had to spin up a new cluster from EBS snapshots and copy it over. What did we do wrong? In hindsight, we noticed a few things which may be clues... - The new nodes had much lower load after joining the cluster than the old ones (3-4 gb as opposed to 10 gb). - We have EC2Snitch turned on, although we're using SimpleStrategy for replication. - The new nodes showed even ownership (via nodetool status) after joining the cluster. Here's more info about our cluster... - Cassandra 1.2.10 - Replication factor of 3 - Vnodes with 256 tokens - All tables made via CQL - Data dirs on EBS (yes, we are aware of the performance implications) Thanks for the help.
Re: Cassandra high heap utilization under heavy reads and writes.
Yes, we saw this same behavior. A couple of months ago, we moved a large portion of our data out of Postgres and into Cassandra. The initial migration was done in a "distributed" manner: we had 600 (or 800, can't remember) processes reading from Postgres and writing to Cassandra in tight loops. This caused the exact behavior you described. We also did a read before a write. After we got through the initial data migration, our normal workload is *much* less writes (and reads for that matter) such that our cluster can easily handle it, so we didn't investigate further. -- C On Sat, Nov 23, 2013 at 10:55 PM, srmore wrote: > Hello, > We moved to cassandra 1.2.9 from 1.0.11 to take advantage of the off-heap > bloom filters and other improvements. > > We see a lot of messages dropped under high load conditions. We noticed > that when we do heavy read AND write simultaneously (we read first and > check whether the key exists if not we write it) Cassandra heap increases > dramatically and then gossip marks the node down (as a result of high load > on the node). > > > Under heavy 'reads only' we don't see this behavior. Has anyone seen this > behavior ? any suggestions. > > Thanks ! > > >
Re: nodetool repair seems to increase linearly with number of keyspaces
We have the same setup: one keyspace per client, and currently about 300 keyspaces. nodetool repair takes a long time, 4 hours with -pr on a single node. We have a 4 node cluster with about 10 gb per node. Unfortunately, we haven't been keeping track of the running time as keyspaces, or load, increases. -- C On Wed, Nov 20, 2013 at 6:53 AM, John Pyeatt wrote: > We have an application that has been designed to use potentially 100s of > keyspaces (one for each company). > > One thing we are noticing is that nodetool repair across all of the > keyspaces seems to increase linearly based on the number of keyspaces. For > example, if we have a 6 node ec2 (m1.large) cluster across 3 Availability > Zones and create 20 keyspaces a nodetool repair -pr on one node takes 3 > hours even with no data in any of the keyspaces. If I bump that up to 40 > keyspaces it takes 6 hours. > > Is this the behaviour you would expect? > > Is there anything you can think of (short of redesigning the cluster to > limit keyspaces) to increase the performance of the nodetool repairs? > > My obvious concern is that as this application grows and we get more > companies using our it we will eventually have too many keyspaces to > perform repairs on the cluster. > > -- > John Pyeatt > Singlewire Software, LLC > www.singlewire.com > -- > 608.661.1184 > john.pye...@singlewire.com >
Prepare is 100 to 200 times slower after migration to 1.2.11
Hi I have migrated my DEV environment from 1.2.8 to 1.2.11 to finally move to 2.0.2, and prepare is 100 to 200 times slower, something that was sub millisecond now is 150 ms. Other CQL operations are normal. I am nor planning to move to 2,0.2 until I fix this. I do not see any warn or error in the log, the only thing I saw was live ratio around 6. Please help. Thanks Shahryar --
Re: How to set Cassandra config directory path
> I noticed when I gave the path directly to cassandra.yaml, it works fine. > Can't I give the directory path here, as mentioned in the doc? Documentation is wrong, the -Dcassandra.config param is used for the path of the yaml file not the config directory. I’ve emailed d...@datastax.com to let them know. > What I really want to do is to give the cassandra-topology.properties path to > Cassandra. Set the CASSANDRA_CONF env var in cassandra-in.sh Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 6:15 am, Bhathiya Jayasekara wrote: > Hi all, > > I'm trying to set conf directory path to Cassandra. According to [1], I can > set it using a system variable as cassandra.config= > > But it doesn't seem to work for me when I give conf directory path. I get > following exception. > > [2013-11-20 22:24:38,273] ERROR > {org.apache.cassandra.config.DatabaseDescriptor} - Fatal configuration error > org.apache.cassandra.exceptions.ConfigurationException: Cannot locate > /home/bhathiya/cassandra/conf/etc > at > org.apache.cassandra.config.DatabaseDescriptor.getStorageConfigURL(DatabaseDescriptor.java:117) > at > org.apache.cassandra.config.DatabaseDescriptor.loadYaml(DatabaseDescriptor.java:134) > at > org.apache.cassandra.config.DatabaseDescriptor.(DatabaseDescriptor.java:126) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:216) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:446) > at > org.wso2.carbon.cassandra.server.CassandraServerController$1.run(CassandraServerController.java:48) > at java.lang.Thread.run(Thread.java:662) > Cannot locate /home/bhathiya/cassandra/conf/etc > Fatal configuration error; unable to start server. See log for stacktrace. > > I noticed when I gave the path directly to cassandra.yaml, it works fine. > Can't I give the directory path here, as mentioned in the doc? > > What I really want to do is to give the cassandra-topology.properties path to > Cassandra. > > [1] > http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/tools/toolsCUtility_t.html > > > Thanks, > Bhathiya > >
Re: Cannot TRUNCATE
If it’s just a test system nuke it and try again :) Was there more than one node at any time ? Does nodetool status show only one node ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 7:45 am, Robert Wille wrote: > I've got a single node with all empty tables, and truncate fails with the > following error: Unable to complete request: one or more nodes were > unavailable. > > Everything else seems fine. I can insert, update, delete, etc. > > The only thing in the logs that looks relevant is this: > > INFO [HANDSHAKE-/192.168.98.121] 2013-11-20 11:36:59,064 > OutboundTcpConnection.java (line 386) Handshaking version with /192.168.98.121 > INFO [HANDSHAKE-/192.168.98.121] 2013-11-20 11:37:04,064 > OutboundTcpConnection.java (line 395) Cannot handshake version with > /192.168.98.121 > > I'm running Cassandra 2.0.2. I get the same error in cqlsh as I do with the > java driver. > > Thanks > > Robert
Re: Config changes to leverage new hardware
> However, for both writes and reads there was virtually no difference in the > latencies. What sort of latency were you getting ? > I’m still not very sure where the current *write* bottleneck is though. What numbers are you getting ? Could the bottle neck be the client ? Can it send writes fast enough to saturate the nodes ? As a rule of thumb you should get 3,000 to 4,000 (non counter) writes per second per core. > Sample iostat data (captured every 10s) for the dedicated disk where commit > logs are written is below. Does this seem like a bottle neck? Does not look too bad. > Another interesting thing is that the linux disk cache doesn’t seem to be > growing in spite of a lot of free memory available. Things will only get paged in when they are accessed. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 12:42 pm, Arindam Barua wrote: > > Thanks for the suggestions Aaron. > > As a follow up, we ran a bunch of tests with different combinations of these > changes on a 2-node ring. The load was generated using cassandra-stress, run > with default values to write 30 million rows, and read them back. > However, for both writes and reads there was virtually no difference in the > latencies. > > The different combinations attempted: > 1. Baseline test with none of the below changes. > 2. Grabbing the TLAB setting from 1.2 > 3. Moving the commit logs too to the 7 disk RAID 0. > 4. Increasing the concurrent_read to 32, and concurrent_write to 64 > 5. (3) + (4), i.e. moving commit logs to the RAID + increasing > concurrent_read and concurrent_write config to 32 and 64. > > The write latencies were very similar, except them being ~3x worse for the > 99.9th percentile and above for scenario (5) above. > The read latencies were also similar, with (3) and (5) being a little worse > for the 99.99th percentile. > > Overall, not making any changes, i.e. (1) performed as well or slightly > better than any of the other changes. > > Running cassandra-stress on both the old and new hardware without making any > config changes, the write performance was very similar, but the new hardware > did show ~10x improvement in the read for the 99.9th percentile and higher. > After thinking about this, the reason why we were not seeing any difference > with our test framework was perhaps the nature of the test where we write the > rows, and then do a bunch of reads to read the rows that were just written > immediately following. The data is read back from the memtables, and never > from the disk/sstables. Hence the new hardware’s increased RAM and size of > the disk cache or higher number of disks never helps. > > I’m still not very sure where the current *write* bottleneck is though. The > new hardware has 32 cores vs 8 cores of the old hardware. Moving the commit > log from a dedicated disk to a 7 RAID-0 disk system (where it would be shared > by other data though) didn’t make a difference too. (unless the extra > contention on the RAID nullified the positive effects of the RAID). > > Sample iostat data (captured every 10s) for the dedicated disk where commit > logs are written is below. Does this seem like a bottle neck? When the commit > logs are written the await/svctm ratio is high. > > Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz > avgqu-sz await svctm %util >0.00 8.09 0.04 8.85 0.00 0.0715.74 0.00 > 0.12 0.03 0.02 >0.00 768.03 0.00 9.49 0.00 3.04 655.41 0.04 > 4.52 0.33 0.31 >0.00 8.10 0.04 8.85 0.00 0.0715.75 0.00 > 0.12 0.03 0.02 >0.00 752.65 0.00 10.09 0.00 2.98 604.75 0.03 > 3.00 0.26 0.26 > > Another interesting thing is that the linux disk cache doesn’t seem to be > growing in spite of a lot of free memory available. The total disk cache used > reported by ‘free’ is less than the size of the sstables written with over > 100 GB unused RAM. > Even in production, where we have the older hardware running with 32 GB RAM > for a long time now, looking at 5 hosts in 1 DC, only 2.5 GB to 8 GB was used > for the disk cache. The Cassandra java process uses the 8 GB allocated to it, > and at least 10-15 GB on all the hosts is not used at all. > > Thanks, > Arindam > > From: Aaron Morton [mailto:aa...@thelastpickle.com] > Sent: Wednesday, November 06, 2013 8:34 PM > To: Cassandra User > Subject: Re: Config changes to leverage new hardware > > Running Cassandra 1.1.5 currently, but evaluating to upgrade to 1.2.11 soon. > You will make more use of the extra memory moving to 1.2 as it moves bloom > filters and compression data off heap. > > Also grab the TLAB setting from cassandra-env.sh in v1.2 > > As
Re: Is there any open source software for automatized deploy C* in PRD?
> Thanks, But I suppose it’s just for Debian? Am I right? There are debian and rpm packages, and people deploy them or the binary packages with with chef and similar tools. It may be easier to answer your question if you describe the specific platform / needs. cheers - Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 10:35 pm, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 wrote: > Thanks, But I suppose it’s just for Debian? Am I right? > Any others? > > Best Regards, > Boole Guo > Software Engineer, NESC-SH.MIS > +86-021-51530666*41442 > Floor 19, KaiKai Plaza, 888, Wanhangdu Rd, Shanghai (200042) > > 发件人: Mike Adamson [mailto:mikeat...@gmail.com] > 发送时间: 2013年11月21日 17:16 > 收件人: user@cassandra.apache.org > 主题: Re: Is there any open source software for automatized deploy C* in PRD? > > Hi Boole, > > Have you tried chef? There is this cookbook for deploying cassandra: > > http://community.opscode.com/cookbooks/cassandra > > MikeA > > > On 21 November 2013 01:33, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 > wrote: > Hi all, > Is there any open source software for automatized deploy C* in PRD? > > Best Regards, > Boole Guo > Software Engineer, NESC-SH.MIS > +86-021-51530666*41442 > Floor 19, KaiKai Plaza, 888, Wanhangdu Rd, Shanghai (200042) > ONCE YOU KNOW, YOU NEWEGG. > > CONFIDENTIALITY NOTICE: This email and any files transmitted with it may > contain privileged or otherwise confidential information. It is intended only > for the person or persons to whom it is addressed. If you received this > message in error, you are not authorized to read, print, retain, copy, > disclose, disseminate, distribute, or use this message any part thereof or > any information contained therein. Please notify the sender immediately and > delete all copies of this message. Thank you in advance for your cooperation. > 保密注意:此邮件及其附随文件可能包含了保密信息。该邮件的目的是发送给指定收件人。如果您非指定收件人而错误地收到了本邮件,您将无权阅读、打印、保存、复制、泄露、传播、分发或使用此邮件全部或部分内容或者邮件中包含的任何信息。请立即通知发件人,并删除该邮件。感谢您的配合!
Re: Migration Cassandra 2.0 to Cassandra 2.0.2
> Mr Coli What's the difference between deploy binaries and the binary package ? > I upload the binary package on the Apache Cassandra Homepage, Am I wrong ? Yes you can use the instructions here for the binary package http://wiki.apache.org/cassandra/DebianPackaging When you use the binary package it creates the directory locations, installs the init scripts and makes it a lot easier to start and stop cassandra. I recommend using them. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 11:06 pm, Bonnet Jonathan. wrote: > Thanks Mr Coli and Mr Wee for your answears, > > Mr Coli What's the difference between deploy binaries and the binary package ? > I upload the binary package on the Apache Cassandra Homepage, Am I wrong ? > > Mr Wee i think you hit the right way, cause my lib directory in my > Cassandra_Home are different between the two versions. In the Home for the > old version /produits/cassandra/install_cassandra/apache-cassandra-2.0.0/lib > i have: > > [cassandra@s00vl9925761 lib]$ ls -ltr > total 14564 > -rw-r- 1 cassandra cassandra 123898 Aug 28 15:07 thrift-server-0.3.0.jar > -rw-r- 1 cassandra cassandra 42854 Aug 28 15:07 > thrift-python-internal-only-0.7.0.zip > -rw-r- 1 cassandra cassandra 55066 Aug 28 15:07 snaptree-0.1.jar > -rw-r- 1 cassandra cassandra 1251514 Aug 28 15:07 snappy-java-1.0.5.jar > -rw-r- 1 cassandra cassandra 270552 Aug 28 15:07 snakeyaml-1.11.jar > -rw-r- 1 cassandra cassandra8819 Aug 28 15:07 slf4j-log4j12-1.7.2.jar > -rw-r- 1 cassandra cassandra 26083 Aug 28 15:07 slf4j-api-1.7.2.jar > -rw-r- 1 cassandra cassandra 134133 Aug 28 15:07 > servlet-api-2.5-20081211.jar > -rw-r- 1 cassandra cassandra 1128961 Aug 28 15:07 netty-3.5.9.Final.jar > -rw-r- 1 cassandra cassandra 80800 Aug 28 15:07 metrics-core-2.0.3.jar > -rw-r- 1 cassandra cassandra 134748 Aug 28 15:07 lz4-1.1.0.jar > -rw-r- 1 cassandra cassandra 481534 Aug 28 15:07 log4j-1.2.16.jar > -rw-r- 1 cassandra cassandra 347531 Aug 28 15:07 libthrift-0.9.0.jar > -rw-r- 1 cassandra cassandra 16046 Aug 28 15:07 json-simple-1.1.jar > -rw-r- 1 cassandra cassandra 91183 Aug 28 15:07 jline-1.0.jar > -rw-r- 1 cassandra cassandra 17750 Aug 28 15:07 jbcrypt-0.3m.jar > -rw-r- 1 cassandra cassandra5792 Aug 28 15:07 jamm-0.2.5.jar > -rw-r- 1 cassandra cassandra 765648 Aug 28 15:07 > jackson-mapper-asl-1.9.2.jar > -rw-r- 1 cassandra cassandra 228286 Aug 28 15:07 > jackson-core-asl-1.9.2.jar > -rw-r- 1 cassandra cassandra 96046 Aug 28 15:07 high-scale-lib-1.1.2.jar > -rw-r- 1 cassandra cassandra 1891110 Aug 28 15:07 guava-13.0.1.jar > -rw-r- 1 cassandra cassandra 66843 Aug 28 15:07 disruptor-3.0.1.jar > -rw-r- 1 cassandra cassandra 91982 Aug 28 15:07 > cql-internal-only-1.4.0.zip > -rw-r- 1 cassandra cassandra 54345 Aug 28 15:07 > concurrentlinkedhashmap-lru-1.3.jar > -rw-r- 1 cassandra cassandra 25490 Aug 28 15:07 compress-lzf-0.8.4.jar > -rw-r- 1 cassandra cassandra 284220 Aug 28 15:07 commons-lang-2.6.jar > -rw-r- 1 cassandra cassandra 30085 Aug 28 15:07 commons-codec-1.2.jar > -rw-r- 1 cassandra cassandra 36174 Aug 28 15:07 commons-cli-1.1.jar > -rw-r- 1 cassandra cassandra 1695790 Aug 28 15:07 > apache-cassandra-thrift-2.0.0.jar > -rw-r- 1 cassandra cassandra 71117 Aug 28 15:07 > apache-cassandra-clientutil-2.0.0.jar > -rw-r- 1 cassandra cassandra 3265185 Aug 28 15:07 > apache-cassandra-2.0.0.jar > -rw-r- 1 cassandra cassandra 1928009 Aug 28 15:07 antlr-3.2.jar > drwxr-x--- 2 cassandra cassandra4096 Oct 1 14:16 licenses > > In my new home i have > /produits/cassandra/install_cassandra/apache-cassandra-2.0.2/lib: > > [cassandra@s00vl9925761 lib]$ ls -ltr > total 9956 > -rw-r- 1 cassandra cassandra 123920 Oct 24 09:21 thrift-server-0.3.2.jar > -rw-r- 1 cassandra cassandra 52477 Oct 24 09:21 > thrift-python-internal-only-0.9.1.zip > -rw-r- 1 cassandra cassandra 55066 Oct 24 09:21 snaptree-0.1.jar > -rw-r- 1 cassandra cassandra 1251514 Oct 24 09:21 snappy-java-1.0.5.jar > -rw-r- 1 cassandra cassandra 270552 Oct 24 09:21 snakeyaml-1.11.jar > -rw-r- 1 cassandra cassandra 26083 Oct 24 09:21 slf4j-api-1.7.2.jar > -rw-r- 1 cassandra cassandra 22291 Oct 24 09:21 > reporter-config-2.1.0.jar > -rw-r- 1 cassandra cassandra 1206119 Oct 24 09:21 netty-3.6.6.Final.jar > -rw-r- 1 cassandra cassandra 82123 Oct 24 09:21 metrics-core-2.2.0.jar > -rw-r- 1 cassandra cassandra 165505 Oct 24 09:21 lz4-1.2.0.jar > -rw-r- 1 cassandra cassandra 217054 Oct 24 09:21 libthrift-0.9.1.jar > -rw-r- 1 cassandra cassandra 16046 Oct 24 09:21 json-simple-1.1.jar > -rw-r- 1 cassandra cassandra 91183 Oct 24 09:21 jline-1.0.jar > -rw-r- 1 cassandra cassandra 17750 Oct 24 09:21 jbcr
Re: nodetool repair seems to increase linearly with number of keyspaces
Mr. Bottaro, About how many column families are in your keyspaces? We have 28 per keyspace. Are you using Vnodes? We are and they are set to 256 What version of cassandra are you running. We are running 1.2.9 On Mon, Nov 25, 2013 at 11:36 AM, Christopher J. Bottaro < cjbott...@academicworks.com> wrote: > We have the same setup: one keyspace per client, and currently about 300 > keyspaces. nodetool repair takes a long time, 4 hours with -pr on a single > node. We have a 4 node cluster with about 10 gb per node. Unfortunately, > we haven't been keeping track of the running time as keyspaces, or load, > increases. > > -- C > > > On Wed, Nov 20, 2013 at 6:53 AM, John Pyeatt > wrote: > >> We have an application that has been designed to use potentially 100s of >> keyspaces (one for each company). >> >> One thing we are noticing is that nodetool repair across all of the >> keyspaces seems to increase linearly based on the number of keyspaces. For >> example, if we have a 6 node ec2 (m1.large) cluster across 3 Availability >> Zones and create 20 keyspaces a nodetool repair -pr on one node takes 3 >> hours even with no data in any of the keyspaces. If I bump that up to 40 >> keyspaces it takes 6 hours. >> >> Is this the behaviour you would expect? >> >> Is there anything you can think of (short of redesigning the cluster to >> limit keyspaces) to increase the performance of the nodetool repairs? >> >> My obvious concern is that as this application grows and we get more >> companies using our it we will eventually have too many keyspaces to >> perform repairs on the cluster. >> >> -- >> John Pyeatt >> Singlewire Software, LLC >> www.singlewire.com >> -- >> 608.661.1184 >> john.pye...@singlewire.com >> > > -- John Pyeatt Singlewire Software, LLC www.singlewire.com -- 608.661.1184 john.pye...@singlewire.com
Inefficiency with large set of small documents?
I’m trying to decide what noSQL database to use, and I’ve certainly decided against mongodb due to its use of mmap. I’m wondering if Cassandra would also suffer from a similar inefficiency with small documents. In mongodb, if you have a large set of small documents (each much less than the 4KB page size) you will require far more RAM to fit your working set into memory, since a large percentage of a 4KB chunk could very easily include infrequently accessed data outside of your working set. Cassandra doesn’t use mmap, but it would still have to intelligently discard the excess data that does not pertain to a small document that exists in the same allocation unit on the hard disk when reading it into RAM. As an example lets say your cluster size is 4KB as well, and you have 1000 small 256 byte documents that are scattered on the disk that you want to fetch on a given query (the total number of documents is over 1 billion). I want to make sure it only consumes roughly 256,000 bytes for those 1000 documents and not 4,096,000 bytes. When it first fetches a cluster from disk it may consume 4KB of cache, but it should ultimately only ideally consume the relevant amount of bytes in RAM. If Cassandra just indiscriminately uses RAM in 4KB blocks than that is unacceptable to me, because if my working set at any given time is just 20% of my huge collection of small sized documents, I don’t want to have to use servers with 5X as much RAM. That’s a huge expense. Thanks, Ben P.S. Here’s a detailed post I made this morning in the mongodb user group about this topic. People have often complained that because mongodb memory maps everything and leaves memory management to the OS's virtual memory system, the swapping algorithm isn't optimized for database usage. I disagree with this. For the most part, the swapping or paging algorithm itself can't be much better than the sophisticated algorithms (such as LRU based ones) that OSes have refined over many years. Why reinvent the wheel? Yes, you could potentially ensure that certain data (such as the indexes) never get swapped out to disk, because even if they haven't been accessed recently the cost of reading them back into memory will be too costly when they are in fact needed. But that's not the bigger issue. It breaks down with small documents << than page size This is where using virtual memory for everything really becomes an issue. Suppose you've got a bunch of really tiny documents (e.g. ~256 bytes) that are much smaller than the virtual memory page size (e.g. 4KB). Now let's say that you've determined that your working set (e.g. those documents in your collection that constitute say 99% of those accessed in a given hour) to be 20GB. But your entire collection size is actually 100GB (it's just that 20% of your documents are much much likely to be accessed in a given time period. It's not uncommon that a small minority of documents will be accessed a large majority of the time). If your collection is randomly distributed (such as would happen if you simply inserted new documents into your collection) then in this example only about 20% of the documents that fit onto a 4KB page will be part of the working set (i.e. the data that you need frequent access to at the moment). The rest of the data will be made up of much less frequently accessed documents, that should ideally be sitting on disk. So there's a huge inefficiency here. 80% of the data that is in RAM is not even something I need to frequently access. In this example, I would need 5X the amount of RAM to accommodate my working set. Now, as a solution to this problem, you could separate your documents into two (or even a few) collections with the grouping done by access frequency. The problem with this, is that your working set can often change as a function of time of day and day of week. If your application is global, your working set will be far different during 12pm local in NY vs 12pm local in Tokyo. But more even more likely is that the working set is constantly changing as new data is inserted into the database. Popularity of a document is often viral. As an example, a photo that's posted on a social network may start off infrequently accessed but then quickly after hundreds of "likes" could become very frequently accessed and part of your working set. You'd need to actively monitor your documents and manually move a document from one collection to the other, which is very inefficient. Quite frankly this is not a burden that should be placed on the user anyways. By punting the problem of memory management to the OS, mongodb requires the user to essentially do its job and group data in a way that patches the inefficiencies in its memory management. As far as I'm concerned, not until mongodb steps up and takes control of memory management can it be taken seriously for very large datasets that often require many small documents with ever changing w
Re: Prepare is 100 to 200 times slower after migration to 1.2.11
I did some test and apparently the prepared statement is not cached at all, in a loop (native protocol, datastax java driver, both 1.3 and 4.0) I prepared the same statement 20 times and the elapsed time where almost identical. I think it has something to do with CASSANDRA-6107 that was implemented in 1.2.11 and 2.0.2, I did the same test with 2.0.2 and got the same result. is there a setting for CQL3 prepared cahe that I have missed in 1.2.11? Thanks On Mon, Nov 25, 2013 at 1:02 PM, Shahryar Sedghi wrote: > > Hi > > I have migrated my DEV environment from 1.2.8 to 1.2.11 to finally move to > 2.0.2, and prepare is 100 to 200 times slower, something that was sub > millisecond now is 150 ms. Other CQL operations are normal. > > I am nor planning to move to 2,0.2 until I fix this. I do not see any > warn or error in the log, the only thing I saw was live ratio around 6. > > Please help. > > Thanks > > Shahryar > > -- > > -- Life has no meaning a priori … It is up to you to give it a meaning, and value is nothing but the meaning that you choose ~ Jean-Paul Sartre
Re: Cannot TRUNCATE
Blowing away the database does indeed seem to fix the problem, but it doesn't exactly make me feel warm and cozy. I have no idea how the database got screwed up, so I don't know what to avoid doing so that I don't have this happen again on a production server. I never had any other nodes, so it has nothing to do with adding or removing nodes. I guess I just cross my fingers and hope it doesn't happen again. Thanks Robert From: Aaron Morton Reply-To: Date: Monday, November 25, 2013 12:46 PM To: Cassandra User Subject: Re: Cannot TRUNCATE If it¹s just a test system nuke it and try again :) Was there more than one node at any time ? Does nodetool status show only one node ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 7:45 am, Robert Wille wrote: > I've got a single node with all empty tables, and truncate fails with the > following error: Unable to complete request: one or more nodes were > unavailable. > > Everything else seems fine. I can insert, update, delete, etc. > > The only thing in the logs that looks relevant is this: > > INFO [HANDSHAKE-/192.168.98.121] 2013-11-20 11:36:59,064 > OutboundTcpConnection.java (line 386) Handshaking version with /192.168.98.121 > INFO [HANDSHAKE-/192.168.98.121] 2013-11-20 11:37:04,064 > OutboundTcpConnection.java (line 395) Cannot handshake version with > /192.168.98.121 > > I'm running Cassandra 2.0.2. I get the same error in cqlsh as I do with the > java driver. > > Thanks > > Robert
RE: Config changes to leverage new hardware
Here are some calculated 'latency' results reported by cassandra-stress when asked to write 10M rows, i.e. cassandra-stress -d , -n 1000 (we actually had cassandra-stress running in deamon mode for the below tests) avg_latency (percentile) 90 99 99.9 99.99 Write: 8 cores, 32 GB, 3-disk RAID 0 0.002982182 0.003963931 0.004692996 0.004792326 Write: 32 cores, 128 GB, 7-disk RAID 0 0.003157515 0.003763181 0.005184429 0.005441946 Read: 8 cores, 32 GB, 3-disk RAID 0 0.002289879 0.057178021 0.173753058 0.24386912 Read: 32 cores, 128 GB, 7-disk RAID 0 0.002317525 0.010937648 0.013205977 0.014270511 The client was another node on the same network with the 8 core, 32 GB RAM specs. I wouldn't expect it to bottleneck, but I can monitor it while generating the load. In general, what would you expect it to bottleneck at? >> Another interesting thing is that the linux disk cache doesn't seem to be >> growing in spite of a lot of free memory available. >Things will only get paged in when they are accessed. Hmm, interesting. I did a test where I just wrote large files to disk, eg. dd if=/dev/zero of=bigfile18 bs=1M count=1 and checked the disk cache, and it increased by exactly the same size of the file written (no reads were done in this case) -Original Message- From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Monday, November 25, 2013 11:55 AM To: Cassandra User Subject: Re: Config changes to leverage new hardware > However, for both writes and reads there was virtually no difference in the > latencies. What sort of latency were you getting ? > I'm still not very sure where the current *write* bottleneck is though. What numbers are you getting ? Could the bottle neck be the client ? Can it send writes fast enough to saturate the nodes ? As a rule of thumb you should get 3,000 to 4,000 (non counter) writes per second per core. > Sample iostat data (captured every 10s) for the dedicated disk where commit > logs are written is below. Does this seem like a bottle neck? Does not look too bad. > Another interesting thing is that the linux disk cache doesn't seem to be > growing in spite of a lot of free memory available. Things will only get paged in when they are accessed. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 12:42 pm, Arindam Barua mailto:aba...@247-inc.com>> wrote: > > Thanks for the suggestions Aaron. > > As a follow up, we ran a bunch of tests with different combinations of these > changes on a 2-node ring. The load was generated using cassandra-stress, run > with default values to write 30 million rows, and read them back. > However, for both writes and reads there was virtually no difference in the > latencies. > > The different combinations attempted: > 1. Baseline test with none of the below changes. > 2. Grabbing the TLAB setting from 1.2 > 3. Moving the commit logs too to the 7 disk RAID 0. > 4. Increasing the concurrent_read to 32, and concurrent_write to 64 > 5. (3) + (4), i.e. moving commit logs to the RAID + increasing > concurrent_read and concurrent_write config to 32 and 64. > > The write latencies were very similar, except them being ~3x worse for the > 99.9th percentile and above for scenario (5) above. > The read latencies were also similar, with (3) and (5) being a little worse > for the 99.99th percentile. > > Overall, not making any changes, i.e. (1) performed as well or slightly > better than any of the other changes. > > Running cassandra-stress on both the old and new hardware without making any > config changes, the write performance was very similar, but the new hardware > did show ~10x improvement in the read for the 99.9th percentile and higher. > After thinking about this, the reason why we were not seeing any difference > with our test framework was perhaps the nature of the test where we write the > rows, and then do a bunch of reads to read the rows that were just written > immediately following. The data is read back from the memtables, and never > from the disk/sstables. Hence the new hardware's increased RAM and size of > the disk cache or higher number of disks never helps. > > I'm still not very sure where the current *write* bottleneck is though. The > new hardware has 32 cores vs 8 cores of the old hardware. Moving the commit > log from a dedicated disk to a 7 RAID-0 disk system (where it would be shared > by other data though) didn't make a difference too. (unless the extra > contention on the RAID nullified the positive effects of the RAID). > > Sample iostat data (captured every 10s) for the dedicated disk where commit > logs are written is below. Does this seem like a bottle neck? When the commit > logs are written the await/svctm ratio is high. > > Device
Re: Cannot TRUNCATE
On Mon, Nov 25, 2013 at 3:35 PM, Robert Wille wrote: > Blowing away the database does indeed seem to fix the problem, but it > doesn't exactly make me feel warm and cozy. I have no idea how the database > got screwed up, so I don't know what to avoid doing so that I don't have > this happen again on a production server. I never had any other nodes, so > it has nothing to do with adding or removing nodes. I guess I just cross my > fingers and hope it doesn't happen again. > [... snip ...] > I'm running Cassandra 2.0.2. I get the same error in cqlsh as I do with > the java driver. > > https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ I know this is a slightly glib response, but hopefully whatever bug you triggered will be resolved by the time 2.0.x series is ready to be run in production.. :D =Rob
Re: nodetool repair seems to increase linearly with number of keyspaces
On Mon, Nov 25, 2013 at 12:28 PM, John Pyeatt wrote: > Are you using Vnodes? We are and they are set to 256 > What version of cassandra are you running. We are running 1.2.9 > Vnode performance vis a vis repair is this JIRA issue : https://issues.apache.org/jira/browse/CASSANDRA-5220 Unfortunately, in Cassandra 2.0 repair has also been changed to be serial per replica in the replica set by default, which is (unless I've misunderstood something...) likely to make it even slower in a direct relationship to RF. https://issues.apache.org/jira/browse/CASSANDRA-5950 This will probably necessitate re-visiting of this : https://issues.apache.org/jira/browse/CASSANDRA-5850 =Rob
Schema disagreement under normal conditions, ALTER TABLE hangs
Recently we had a strange thing happen. Altering schema (gc_grace_seconds) for a column family resulted in a schema disagreement. 3/4 of nodes got it, 1/4 didn't. There was no partition at the time, nor was there multiple schema updates issued. Going to the nodes with stale schema and trying to do the ALTER TABLE there resulted in hanging. We were eventually able to get schema agreement by restarting nodes, but both the initial disagreement under normal conditions and the hanging ALTER TABLE seem pretty weird. Any ideas here? Sound like a bug? We're on 1.2.8. Thanks, Josh -- Josh Dzielak • Keen IO • @dzello (https://twitter.com/dzello)
Re: Prepare is 100 to 200 times slower after migration to 1.2.11
It can be https://issues.apache.org/jira/browse/CASSANDRA-6369 fixed in 1.2.12/2.0.3 -M "Shahryar Sedghi" wrote in message news:cajuqix7_jvwbj7sx5p8hvmwy5od5ze7pbtv1y5ttga2aws6...@mail.gmail.com... I did some test and apparently the prepared statement is not cached at all, in a loop (native protocol, datastax java driver, both 1.3 and 4.0) I prepared the same statement 20 times and the elapsed time where almost identical. I think it has something to do with CASSANDRA-6107 that was implemented in 1.2.11 and 2.0.2, I did the same test with 2.0.2 and got the same result. is there a setting for CQL3 prepared cahe that I have missed in 1.2.11? Thanks On Mon, Nov 25, 2013 at 1:02 PM, Shahryar Sedghi wrote: Hi I have migrated my DEV environment from 1.2.8 to 1.2.11 to finally move to 2.0.2, and prepare is 100 to 200 times slower, something that was sub millisecond now is 150 ms. Other CQL operations are normal. I am nor planning to move to 2,0.2 until I fix this. I do not see any warn or error in the log, the only thing I saw was live ratio around 6. Please help. Thanks Shahryar -- -- Life has no meaning a priori … It is up to you to give it a meaning, and value is nothing but the meaning that you choose ~ Jean-Paul Sartre