Re: Replicate On Write behavior
On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne wrote: > I'm curious... digging through the source, it looks like replicate on write > triggers a read of the entire row, and not just the columns/supercolumns that > are affected by the counter update. Is this the case? It would certainly > explain why my inserts/sec decay over time and why the average insert latency > increases over time. The strange thing is that I'm not seeing disk read IO > increase over that same period, but that might be due to the OS buffer > cache... It does not. It only reads the columns/supercolumns affected by the counter update. In the source, this happens in CounterMutation.java. If you look at addReadCommandFromColumnFamily you'll see that it does a query by name only for the column involved in the update (the update is basically the content of the columnFamily parameter there). And Cassandra does *not* always reads a full row. Never had, never will. > On another note, on a 5-node cluster, I'm only seeing 3 nodes with > ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? > I'm using RandomPartitioner... > > Address DC Rack Status State Load Owns > Token > > 136112946768375385385349842972707284580 > 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 > 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% > 34028236692093846346337460743176821145 > 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% > 68056473384187692692674921486353642290 > 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% > 102084710076281539039012382229530463435 > 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% > 136112946768375385385349842972707284580 > > The nodes with ReplicateOnWrites are the 3 in the middle. The first node and > last node both have a count of 0. This is a clean cluster, and I've been > doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. > The last time this test ran, it went all the way down to 500 inserts/sec > before I killed it. Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. -- Sylvain
Re: RF=1 w/ hadoop jobs
On Fri, Sep 2, 2011 at 08:54, Mick Semb Wever wrote: > Patrik: is it possible to describe the use-case you have here? Sure. We use Cassandra as a storage for web-pages, we store the HTML, all URLs that has the same HTML data and some computed data. We run Hadoop MR jobs to compute lexical and thematical data for each page and for exporting the data to a binary files for later use. URL gets to a Cassandra on user request (a pageview) so if we delete an URL, it gets back quickly if the page is active. Because of that and because there is lots of data, we have the keyspace set to RF=1. We can drop the whole keyspace and it will regenerate quickly and would contain only fresh data, so we don't care about lossing a node. But Hadoop does care, well to be specific the Cassnadra ColumnInputFormat and ColumnRecortReader are the problem parts. If I stop one Cassandra node all MR jobs that read/write Cassandra fail. In our case, it doesn't matter, we can skip the range of URLs. The MR jobs run in a tight loop, so when the node is back with it's data, we use them. It's not only about some HW crash but it makes maintenance quite difficult. To stop a Cassandra node, you have to stop tasktracker there too which is unfortunate as there are another MR jobs that don't need Cassandra and can happily run. Regards, P.
Re: Removal of old data files
On Fri, Sep 2, 2011 at 12:11 AM, wrote: > Yes, I see files with name like > Orders-g-6517-Compacted > > However, all of those file have a size of 0. > > Starting from Monday to Thurseday we have 5642 files for -Data.db, > -Filter.db and Statistics.db and only 128 -Compacted files. > and all of -Compacted file has size of 0. > > Is this normal, or we are doing something wrong? You are not doing something wrong. The -Compacted files are just marker, to indicate that the -Data file corresponding (with the same number) are, in fact, compacted and will eventually be removed. So those files will always have a size of 0. -- Sylvain > > > yuki > > > From: aaron morton [mailto:aa...@thelastpickle.com] > Sent: Thursday, August 25, 2011 6:13 PM > To: user@cassandra.apache.org > Subject: Re: Removal of old data files > > If cassandra does not have enough disk space to create a new file it will > provoke a JVM GC which should result in compacted SStables that are no > longer needed been deleted. Otherwise they are deleted at some time in the > future. > Compacted SSTables have a file written out with a "compacted" extension. > Do you see compacted sstables in the data directory? > Cheers. > - > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > On 26/08/2011, at 2:29 AM, yuki watanabe wrote: > > We are using Cassandra 0.8.0 with 8 node ring and only one CF. > Every column has TTL of 86400 (24 hours). we also set 'GC grace second' to > 43200 > (12 hours). We have to store massive amount of data for one day now and > eventually for five days if we get more disk space. > Even for one day, we do run out disk space in a busy day. > > We run nodetool compact command at night or as necessary then we run GC from > jconsole. We observed that GC did remove files but not necessarily oldest > ones. > Data files from more than 36 hours ago and quite often three days ago are > still there. > > Does this behavior expected or we need adjust some other parameters? > > > Yuki Watanabe > > ___ > > > > This e-mail may contain information that is confidential, privileged or > otherwise protected from disclosure. If you are not an intended recipient of > this e-mail, do not duplicate or redistribute it by any means. Please delete > it and any attachments and notify the sender that you have received it in > error. Unless specifically indicated, this e-mail is not an offer to buy or > sell or a solicitation to buy or sell any securities, investment products or > other financial product or service, an official confirmation of any > transaction, or an official statement of Barclays. Any views or opinions > presented are solely those of the author and do not necessarily represent > those of Barclays. This e-mail is subject to terms available at the > following link: www.barcap.com/emaildisclaimer. By messaging with Barclays > you consent to the foregoing. Barclays Capital is the investment banking > division of Barclays Bank PLC, a company registered in England (number > 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. > This email may relate to or be sent from other members of the Barclays > Group. > > ___
Re: Replicate On Write behavior
That's interesting. I did an experiment wherein I added some entropy to the row name based on the time when the increment came in, (e.g. row = row + "/" + (timestamp - (timestamp % 300))) and now not only is the load (in GB) on my cluster more balanced, the performance has not decayed and has stayed steady (inserts/sec) with a relatively low average ms/insert. Each row is now significantly shorter as a result of this change. On Sep 2, 2011, at 12:30 AM, Sylvain Lebresne wrote: > On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne wrote: >> I'm curious... digging through the source, it looks like replicate on write >> triggers a read of the entire row, and not just the columns/supercolumns >> that are affected by the counter update. Is this the case? It would >> certainly explain why my inserts/sec decay over time and why the average >> insert latency increases over time. The strange thing is that I'm not >> seeing disk read IO increase over that same period, but that might be due to >> the OS buffer cache... > > It does not. It only reads the columns/supercolumns affected by the > counter update. > In the source, this happens in CounterMutation.java. If you look at > addReadCommandFromColumnFamily you'll see that it does a query by name > only for the column involved in the update (the update is basically > the content of the columnFamily parameter there). > > And Cassandra does *not* always reads a full row. Never had, never will. > >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC RackStatus State LoadOwns >> Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 >> 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. > > Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. > > -- > Sylvain
SSTableSimpleUnsortedWriter take long time when inserting big rows
Hi All, I started using SSTableSimpleUnsortedWriter to load data, and my data has a few rows but a lot of column name in each rows. I call SSTableSimpleUnsortedWriter.newRow every 10'000 columns inserted. But the time taken to insert columns is increasing as the column family is increasing. The problem appears because everytime we call newRow, all the columns of the previous CF is added to the new CF. Attached is a small patch that check which is the smallest CF, and add the smallest CF to the biggest one. Should I open I bug for that ? Thanks in advance, Benoit Index: src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java === --- src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java (revision 1164377) +++ src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java (working copy) @@ -73,9 +73,17 @@ // Note that if the row was existing already, our size estimation will be slightly off // since we'll be counting the key multiple times. -if (previous != null) -columnFamily.addAll(previous); - +if (previous != null) { +// Add the smallest CF to the other one +if (columnFamily.getSortedColumns().size() < previous.getSortedColumns().size()) { +previous.addAll(columnFamily); +// Re-add the previous CF to the map because it has been overwritten +keys.put(key, previous); +} else { +columnFamily.addAll(previous); +} +} + if (currentSize > bufferSize) sync(); }
Re: SSTableSimpleUnsortedWriter take long time when inserting big rows
On Fri, Sep 2, 2011 at 10:29 AM, Benoit Perroud wrote: > Hi All, > > I started using SSTableSimpleUnsortedWriter to load data, and my data > has a few rows but a lot of column name in each rows. > > I call SSTableSimpleUnsortedWriter.newRow every 10'000 columns inserted. > > But the time taken to insert columns is increasing as the column > family is increasing. The problem appears because everytime we call > newRow, all the columns of the previous CF is added to the new CF. If I understand correctly, each row has way more that 10 000 columns, but you call newRow every 10 000 columns, right ? Note that you have the possibility to decrease the frequency of the calls to newRow. But anyway, I agree that the code shouldn't suck like that. > Attached is a small patch that check which is the smallest CF, and add > the smallest CF to the biggest one. > > Should I open I bug for that ? Please do. I'm actually thinking of a slightly different fix: we should not have to add all the previous columns to the new column family, we should just directly reuse the previous column family when adding the new column. But the JIRA ticket will be a better place to discuss this. -- Sylvain
Re: SSTableSimpleUnsortedWriter take long time when inserting big rows
Thanks for your answer. 2011/9/2 Sylvain Lebresne : > On Fri, Sep 2, 2011 at 10:29 AM, Benoit Perroud wrote: >> Hi All, >> >> I started using SSTableSimpleUnsortedWriter to load data, and my data >> has a few rows but a lot of column name in each rows. >> >> I call SSTableSimpleUnsortedWriter.newRow every 10'000 columns inserted. >> >> But the time taken to insert columns is increasing as the column >> family is increasing. The problem appears because everytime we call >> newRow, all the columns of the previous CF is added to the new CF. > > If I understand correctly, each row has way more that 10 000 columns, but > you call newRow every 10 000 columns, right ? Yes. I call newRow every 10 000 columns to be sure to flush as soon as possible. > Note that you have the possibility to decrease the frequency of the calls to > newRow. > > But anyway, I agree that the code shouldn't suck like that. > >> Attached is a small patch that check which is the smallest CF, and add >> the smallest CF to the biggest one. >> >> Should I open I bug for that ? > > Please do. I'm actually thinking of a slightly different fix: we should not > have > to add all the previous columns to the new column family, we should just > directly reuse the previous column family when adding the new column. > But the JIRA ticket will be a better place to discuss this. Opened : https://issues.apache.org/jira/browse/CASSANDRA-3122 Let's discuss there. Thanks ! Benoit. > -- > Sylvain >
Cassandra, CQL, Thrift Deprecation?? and Erlang
Hi, I'm a fan of erlang, and have been using successive cassandra versions via the erlang thrift interface for a couple of years now. I see that cassandra seems to be moving to using CQL instead and so I was wondering if that means the thrift api will be deprecated and if so is there any effort underway to by anyone to create (whatever would be neccessary) to use cassandra via cql from erlang ? JT
Re: cassandra-cli describe / dump command
Thats brilliant, thanks. On Thu, Sep 1, 2011 at 7:07 PM, Jonathan Ellis wrote: > yes, cli "show schema" in 0.8.4+ > > On Thu, Sep 1, 2011 at 12:52 PM, J T wrote: > > Hi, > > > > I'm probably being blind .. but I can't see any way to dump the schema > > definition (and the data in it for that matter) of a cluster in order to > > capture the current schema in a script file for subsequent replaying in > to a > > different environment. > > > > For example, say I have a DEV env and wanted to create a script > containing > > the cli commands to create that schema in a UAT env. > > > > In my case, I have a cassandra schema I've been tweaking / upgrading over > > the last 2 years and I can't see any easy way to capture the schema > > definition. > > > > Is such a thing on the cards for cassandra-cli ? > > > > JT > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >
Re: Cassandra, CQL, Thrift Deprecation?? and Erlang
The Thrift API is not going anywhere any time soon. I'm not aware of anyone working on an erlang CQL client. On Fri, Sep 2, 2011 at 7:39 AM, J T wrote: > Hi, > > I'm a fan of erlang, and have been using successive cassandra versions via > the erlang thrift interface for a couple of years now. > > I see that cassandra seems to be moving to using CQL instead and so I was > wondering if that means the thrift api will be deprecated and if so is there > any effort underway to by anyone to create (whatever would be neccessary) to > use cassandra via cql from erlang ? > > JT > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
RE: Removal of old data files
I see. Thank you for helpful information Yuki -Original Message- From: Sylvain Lebresne [mailto:sylv...@datastax.com] Sent: Friday, September 02, 2011 3:40 AM To: user@cassandra.apache.org Subject: Re: Removal of old data files On Fri, Sep 2, 2011 at 12:11 AM, wrote: > Yes, I see files with name like > Orders-g-6517-Compacted > > However, all of those file have a size of 0. > > Starting from Monday to Thurseday we have 5642 files for -Data.db, > -Filter.db and Statistics.db and only 128 -Compacted files. > and all of -Compacted file has size of 0. > > Is this normal, or we are doing something wrong? You are not doing something wrong. The -Compacted files are just marker, to indicate that the -Data file corresponding (with the same number) are, in fact, compacted and will eventually be removed. So those files will always have a size of 0. -- Sylvain > > > yuki > > > From: aaron morton [mailto:aa...@thelastpickle.com] > Sent: Thursday, August 25, 2011 6:13 PM > To: user@cassandra.apache.org > Subject: Re: Removal of old data files > > If cassandra does not have enough disk space to create a new file it > will provoke a JVM GC which should result in compacted SStables that > are no longer needed been deleted. Otherwise they are deleted at some > time in the future. > Compacted SSTables have a file written out with a "compacted" extension. > Do you see compacted sstables in the data directory? > Cheers. > - > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > On 26/08/2011, at 2:29 AM, yuki watanabe wrote: > > We are using Cassandra 0.8.0 with 8 node ring and only one CF. > Every column has TTL of 86400 (24 hours). we also set 'GC grace > second' to 43200 > (12 hours). We have to store massive amount of data for one day now > and eventually for five days if we get more disk space. > Even for one day, we do run out disk space in a busy day. > > We run nodetool compact command at night or as necessary then we run > GC from jconsole. We observed that GC did remove files but not > necessarily oldest ones. > Data files from more than 36 hours ago and quite often three days ago > are still there. > > Does this behavior expected or we need adjust some other parameters? > > > Yuki Watanabe > > ___ > > > > This e-mail may contain information that is confidential, privileged > or otherwise protected from disclosure. If you are not an intended > recipient of this e-mail, do not duplicate or redistribute it by any > means. Please delete it and any attachments and notify the sender that > you have received it in error. Unless specifically indicated, this > e-mail is not an offer to buy or sell or a solicitation to buy or sell > any securities, investment products or other financial product or > service, an official confirmation of any transaction, or an official > statement of Barclays. Any views or opinions presented are solely > those of the author and do not necessarily represent those of > Barclays. This e-mail is subject to terms available at the following > link: www.barcap.com/emaildisclaimer. By messaging with Barclays you > consent to the foregoing. Barclays Capital is the investment banking > division of Barclays Bank PLC, a company registered in England (number > 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. > This email may relate to or be sent from other members of the Barclays > Group. > > ___
looking for information on composite columns
Hi, I am looking for information/tutorials on the use of composite columns, including how to use it, what kind of indexing it can offer, and its advantage over super columns. I googled but came up with very little information. There is a blog article from high performance cassandra on the compositeType comparator, but the use case is a composite column name rather than a column value. Does anyone know of some good resources on this and is willing to share with me? Thanks. -- Y.
Re: Cassandra, CQL, Thrift Deprecation?? and Erlang
Ok, thats good to know. If push came to shove I could probably write such a client myself after doing the necessary research but I'd prefer to save myself the hassle. Thanks. On Fri, Sep 2, 2011 at 1:59 PM, Jonathan Ellis wrote: > The Thrift API is not going anywhere any time soon. > > I'm not aware of anyone working on an erlang CQL client. > > On Fri, Sep 2, 2011 at 7:39 AM, J T wrote: > > Hi, > > > > I'm a fan of erlang, and have been using successive cassandra versions > via > > the erlang thrift interface for a couple of years now. > > > > I see that cassandra seems to be moving to using CQL instead and so I was > > wondering if that means the thrift api will be deprecated and if so is > there > > any effort underway to by anyone to create (whatever would be neccessary) > to > > use cassandra via cql from erlang ? > > > > JT > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >
removing all column metadata via CLI
I cant find way how to remove all columns definitions without CF import/export. [default@int4] update column family sipdb with column_metadata = []; Syntax error at position 51: required (...)+ loop did not match anything at input ']' [default@int4] update column family sipdb with column_metadata = [{}]; Command not found: `update column family sipdb with column_metadata = [{}];`. Type 'help;' or '?' for help. [default@int4]
Re: 15 seconds to increment 17k keys?
On Thu, Sep 1, 2011 at 5:16 PM, Ian Danforth wrote: > Does this scale with multiples of the replication factor or directly > with number of nodes? Or more succinctly, to double the writes per > second into the cluster how many more nodes would I need? The write throughput scales with number of nodes, so double to get double the write capacity. Increasing the replication factor in general doesn't improve performance (and increasing without increasing number of nodes decreases performance). This is because operations are performed on all available replicas (with the exception of reads with low consistency levels and read_repair_chance < 1.0). Note also that there is just one read per counter increment, not a read per replica. -- Richard Low Acunu | http://www.acunu.com | @acunu
Re: removing all column metadata via CLI
Is this 0.8.4? 2011/9/2 Radim Kolar : > I cant find way how to remove all columns definitions without CF > import/export. > > [default@int4] update column family sipdb with column_metadata = []; > Syntax error at position 51: required (...)+ loop did not match anything at > input ']' > > [default@int4] update column family sipdb with column_metadata = [{}]; > Command not found: `update column family sipdb with column_metadata = > [{}];`. Type 'help;' or '?' for help. > [default@int4] > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: looking for information on composite columns
On Fri, Sep 2, 2011 at 9:15 AM, Yiming Sun wrote: > Hi, > > I am looking for information/tutorials on the use of composite columns, > including how to use it, what kind of indexing it can offer, and its > advantage over super columns. I googled but came up with very little > information. There is a blog article from high performance cassandra on the > compositeType comparator, but the use case is a composite column name rather > than a column value. Does anyone know of some good resources on this and is > willing to share with me? Thanks. > > -- Y. > I am going to do some more composite recipes in my blog, I noticed from my search refers that it is a very hot topic. www.anuff.com/2011/02/indexing-in-cassandra.html www.anuff.com/2010/07/secondary-indexes-in-cassandra.html www.datastax.com/2011/06/ed-anuff-to-speak-at-cassandra-sf-2011 Composite columns do not do indexing in themselves but the way they allow multiple components to live in one column but still sort properly is how they relate to indexing. Edward
Re: looking for information on composite columns
Thanks Edward. What's the link to your blog? On Fri, Sep 2, 2011 at 10:43 AM, Edward Capriolo wrote: > > On Fri, Sep 2, 2011 at 9:15 AM, Yiming Sun wrote: > >> Hi, >> >> I am looking for information/tutorials on the use of composite columns, >> including how to use it, what kind of indexing it can offer, and its >> advantage over super columns. I googled but came up with very little >> information. There is a blog article from high performance cassandra on the >> compositeType comparator, but the use case is a composite column name rather >> than a column value. Does anyone know of some good resources on this and is >> willing to share with me? Thanks. >> >> -- Y. >> > > I am going to do some more composite recipes in my blog, I noticed from my > search refers that it is a very hot topic. > > www.anuff.com/2011/02/indexing-in-cassandra.html > www.anuff.com/2010/07/secondary-indexes-in-cassandra.html > www.datastax.com/2011/06/ed-anuff-to-speak-at-cassandra-sf-2011 > > Composite columns do not do indexing in themselves but the way they allow > multiple components to live in one column but still sort properly is how > they relate to indexing. > > Edward >
Re: removing all column metadata via CLI
> Is this 0.8.4? yes
Cassandra prod environment
Hey, Currently I'm running Cassandra on Ubuntu 10.4 x86_64 in EC2. I'm wondering if anyone observed a better performance / stability on other distros ( CentOS / RHEL / ...) or OS (eg. Solaris intel/SPARC) ? Is anyone running prod on VMs, not cloud, but ESXi or Solaris zones ? Is there love or hate :) ? Any storage best-practices on VM environments ? I like xfs ! Any observations on xfs / ext4 / zfs, from Cassandra usage perspective ? Cheers, Sorin
Re: Cassandra prod environment
On 09/02/2011 11:30 AM, Sorin Julean wrote: Hey, Currently I'm running Cassandra on Ubuntu 10.4 x86_64 in EC2. I'm wondering if anyone observed a better performance / stability on other distros ( CentOS / RHEL / ...) or OS (eg. Solaris intel/SPARC) ? Is anyone running prod on VMs, not cloud, but ESXi or Solaris zones ? Is there love or hate :) ? Any storage best-practices on VM environments ? I like xfs ! Any observations on xfs / ext4 / zfs, from Cassandra usage perspective ? Cheers, Sorin We are running 6 nodes in production on KVM virtual machines with Centos 6 as the host, and guest OS with and open JDK (I know sun JRE is reccomended) and cassandra 0.7.8. We have no problems with stability or performance. We run no raid ext3 in LVM for file systems. -Eric
Re: removing all column metadata via CLI
Then you'll want to create an issue: https://issues.apache.org/jira/browse/CASSANDRA On Fri, Sep 2, 2011 at 10:08 AM, Radim Kolar wrote: >> Is this 0.8.4? > yes > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Cassandra prod environment
We moved off of ubuntu because of kernel issues in the AMIs we found in 10.04 and 10.10 in ec2. So we're now on debian squeeze with ext4. It's been great for us. One thing that bit us is we'd been using property file snitch and the availability zones as racks and had an equal number of nodes in each availability zone. However we hadn't realized that you need to rotate between racks (AZs) with each token - so for US-East, in token order, we needed to go something like AZ A, B, C, A, B, C for six nodes. Otherwise you will get hotspots because of how replication happens. For some best practices in ec2, check out http://www.slideshare.net/mattdennis/cassandra-on-ec2 On Sep 2, 2011, at 10:30 AM, Sorin Julean wrote: > Hey, > > Currently I'm running Cassandra on Ubuntu 10.4 x86_64 in EC2. > > I'm wondering if anyone observed a better performance / stability on other > distros ( CentOS / RHEL / ...) or OS (eg. Solaris intel/SPARC) ? > Is anyone running prod on VMs, not cloud, but ESXi or Solaris zones ? Is > there love or hate :) ? Any storage best-practices on VM environments ? > I like xfs ! Any observations on xfs / ext4 / zfs, from Cassandra usage > perspective ? > > Cheers, > Sorin
Re: Trying to understand QUORUM and Strategies
So. You have created keyspace with SimpleStrategy. If you want to use *LOCAL_QUORUM, *you should create keyspace (or change existing) with NetworkTopologyStrategy. I have provided CLI examples on how to do it. If you are creating keyspace from Hector, you have to do same via Java API. Evgeny.
Re: Replicate On Write behavior
That ticket explains a lot, looking forward to a resolution on it. (Sorry I don't have a patch to offer) Ian On Fri, Sep 2, 2011 at 12:30 AM, Sylvain Lebresne wrote: > On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne wrote: >> I'm curious... digging through the source, it looks like replicate on write >> triggers a read of the entire row, and not just the columns/supercolumns >> that are affected by the counter update. Is this the case? It would >> certainly explain why my inserts/sec decay over time and why the average >> insert latency increases over time. The strange thing is that I'm not >> seeing disk read IO increase over that same period, but that might be due to >> the OS buffer cache... > > It does not. It only reads the columns/supercolumns affected by the > counter update. > In the source, this happens in CounterMutation.java. If you look at > addReadCommandFromColumnFamily you'll see that it does a query by name > only for the column involved in the update (the update is basically > the content of the columnFamily parameter there). > > And Cassandra does *not* always reads a full row. Never had, never will. > >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC Rack Status State Load Owns >> Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 >> 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. > > Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. > > -- > Sylvain >
Re: Trying to understand QUORUM and Strategies
Okay, great I just wanted to confirm that LOCAL_QUORUM will not work with SimpleStrategy. There was somewhat of a debate amongst my devs that said it should work. Anthon On Fri, Sep 2, 2011 at 9:55 AM, Evgeniy Ryabitskiy < evgeniy.ryabits...@wikimart.ru> wrote: > So. > You have created keyspace with SimpleStrategy. > If you want to use *LOCAL_QUORUM, *you should create keyspace (or change > existing) with NetworkTopologyStrategy. > > I have provided CLI examples on how to do it. If you are creating keyspace > from Hector, you have to do same via Java API. > > Evgeny. > > >
JMX TotalReadLatencyMicros sanity check
I've graphed the rate of change of the TotalReadLatencyMicros counter over the last 12 hours, and divided by 1,000,000 to get it in seconds. I'm grabbing it every 10 seconds, so I divided by another 10 to get per-second rates. The result is that I have a CF doing 10 seconds of read *every second*. Does that make sense? If I divide it by the number of reads done, it matches up with the latency I'm seeing from cfstats: 1.5ms/read.
HUnavailableException: : May not be enough replicas present to handle consistency level.
I believe I don't quite understand semantics of this exception: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level. Does it mean there *might be* enough? Does it mean there *is not* enough? My case is as following - I have 3 nodes with keyspaces configured as following: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [DC2:2, DC1:2] Hector can only connect to nodes in DC1 and configured to neither see nor connect to nodes in DC2. This is for replication by Cassandra means, asynchronously between datacenters DC1 and DC2. Each of 6 total nodes can see any of the remaining 5. and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. However, this morning one node went down and I started seeing the HUnavailableException: : May not be enough replicas present to handle consistency level. I believed if I have 3 nodes and one goes down, two remaining nodes are sufficient for my configuration. Please help me to understand what's going on.
Streaming stuck on one node during Repair
Hello, I have one node of a cluster that is stuck in a streaming out state sending to the node that is being repaired. If I looked the AE Thread in jconsole I see this trace: Name: AE-SERVICE-STAGE:1 State: WAITING on java.util.concurrent.FutureTask$Sync@7e3e0044 Total blocked: 0 Total waited: 23 Stack trace: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) java.util.concurrent.FutureTask.get(FutureTask.java:83) org.apache.cassandra.service.AntiEntropyService$Differencer.performStreamingRepair(AntiEntropyService.java:515) org.apache.cassandra.service.AntiEntropyService$Differencer.run(AntiEntropyService.java:475) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) The Steam stage shows this trace: Name: STREAM-STAGE:1 State: WAITING on org.apache.cassandra.utils.SimpleCondition@1158f928 Total blocked: 9 Total waited: 16 Stack trace: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38) org.apache.cassandra.streaming.StreamOutManager.waitForStreamCompletion(StreamOutManager.java:164) org.apache.cassandra.streaming.StreamOut.transferSSTables(StreamOut.java:138) org.apache.cassandra.service.AntiEntropyService$Differencer$1.runMayThrow(AntiEntropyService.java:511) org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) java.util.concurrent.FutureTask.run(FutureTask.java:138) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Is there a way to unstick these threads? Or am I stuck restarting the node and then rerunning the entire repair? All the other nodes seemed to complete properly and one is still running. I am thinking to wait until the current one finishes and then restart the stuck nodes then once its up run repair again on the node needing it. Thoughts? (0.6.6 on a 7 nodes cluster) -- Jake Maizel Head of Network Operations Soundcloud Mail & GTalk: j...@soundcloud.com Skype: jakecloud Rosenthaler strasse 13, 101 19, Berlin, DE
Re: Limiting ColumnSlice range in second composite value
Instead of empty strings, try Character.[MAX|MIN-]_VALUE. On Thu, Sep 1, 2011 at 8:27 PM, Anthony Ikeda wrote: > My Column name is of Composite(TimeUUIDType, UTF8Type) and I can query > across the TimeUUIDs correctly, but now I want to also range across the UTF8 > component. Is this possible? > > UUID start = uuidForDate(new Date(1979, 1, 1)); > > UUID end = uuidForDate(new Date(Long.MAX_VALUE)); > > String startState = ""; > > String endState = ""; > > if (desiredState != null) { > > mLog.debug("Restricting state to [" + desiredState.getValue() + "]"); > > startState = desiredState.getValue(); > > endState = desiredState.getValue().concat("_"); > > } > > > > Composite startComp = new Composite(start, startState); > > Composite endComp = new Composite(end, endState); > > query.setRange(startComp, endComp, true, count); > > So far I'm not seeing any effect setting my "endState" String value. > > Anthony
Re: Trying to understand QUORUM and Strategies
Note that this is an implementation detail, not something that inherently can't work with other strategies. LOCAL_QUORUM and EACH_QUORUM are logically equivalent to QUORUM when there is a single datacenter. We tried briefly to add support for non-NTS strategies in https://issues.apache.org/jira/browse/CASSANDRA-2516, but reverted it in https://issues.apache.org/jira/browse/CASSANDRA-2627. On Fri, Sep 2, 2011 at 12:53 PM, Anthony Ikeda wrote: > Okay, great I just wanted to confirm that LOCAL_QUORUM will not work with > SimpleStrategy. There was somewhat of a debate amongst my devs that said it > should work. > Anthon > > On Fri, Sep 2, 2011 at 9:55 AM, Evgeniy Ryabitskiy > wrote: >> >> So. >> You have created keyspace with SimpleStrategy. >> If you want to use LOCAL_QUORUM, you should create keyspace (or change >> existing) with NetworkTopologyStrategy. >> >> I have provided CLI examples on how to do it. If you are creating keyspace >> from Hector, you have to do same via Java API. >> >> Evgeny. >> >> > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Replicate On Write behavior
Does it always pick the node with the lowest IP address? All of my hosts are in the same /24. The fourth node in the 5 node cluster has the lowest value in the 4th octet (54). I erased the cluster and rebuilt it from scratch as a 3 node cluster using the first 3 nodes, and now the ReplicateOnWrites are all going to the third node, which is also the lowest valued IP address (55). That would explain why only 1 node gets writes in a 3 node cluster (RF=3) and why 3 nodes get writes in a 5 node cluster, and why one of those 3 is taking 66% of the writes. > >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC RackStatus State LoadOwns >> Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 >> 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. > > Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. > > -- > Sylvain
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
It looks like you only have 2 replicas configured in each data center? If so, LOCAL_QUORUM cannot be achieved with a host down same as with QUORUM on RF=2 in a single DC cluster. On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev wrote: > I believe I don't quite understand semantics of this exception: > > me.prettyprint.hector.api.exceptions.HUnavailableException: : May not > be enough replicas present to handle consistency level. > > Does it mean there *might be* enough? > Does it mean there *is not* enough? > > My case is as following - I have 3 nodes with keyspaces configured as > following: > > Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy > Durable Writes: true > Options: [DC2:2, DC1:2] > > Hector can only connect to nodes in DC1 and configured to neither see > nor connect to nodes in DC2. This is for replication by Cassandra > means, asynchronously between datacenters DC1 and DC2. Each of 6 total > nodes can see any of the remaining 5. > > and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. > However, this morning one node went down and I started seeing the > HUnavailableException: : May not be enough replicas present to handle > consistency level. > > I believed if I have 3 nodes and one goes down, two remaining nodes > are sufficient for my configuration. > > Please help me to understand what's going on. >
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
Well, this is the part I don't understand then. I thought that if I configure 2 replicas with 3 nodes and one of 3 nodes goes down, I'll still have 2 nodes to store 3 replicas. Is my logic flawed somehere? On Fri, Sep 2, 2011 at 1:22 PM, Nate McCall wrote: > It looks like you only have 2 replicas configured in each data center? > > If so, LOCAL_QUORUM cannot be achieved with a host down same as with > QUORUM on RF=2 in a single DC cluster. > > On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev wrote: >> I believe I don't quite understand semantics of this exception: >> >> me.prettyprint.hector.api.exceptions.HUnavailableException: : May not >> be enough replicas present to handle consistency level. >> >> Does it mean there *might be* enough? >> Does it mean there *is not* enough? >> >> My case is as following - I have 3 nodes with keyspaces configured as >> following: >> >> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy >> Durable Writes: true >> Options: [DC2:2, DC1:2] >> >> Hector can only connect to nodes in DC1 and configured to neither see >> nor connect to nodes in DC2. This is for replication by Cassandra >> means, asynchronously between datacenters DC1 and DC2. Each of 6 total >> nodes can see any of the remaining 5. >> >> and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. >> However, this morning one node went down and I started seeing the >> HUnavailableException: : May not be enough replicas present to handle >> consistency level. >> >> I believed if I have 3 nodes and one goes down, two remaining nodes >> are sufficient for my configuration. >> >> Please help me to understand what's going on. >> >
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
from http://www.datastax.com/docs/0.8/consistency/index: I have RF=2, so majority of replicas is 2/2+1=2 which I have after 3rd node goes down? On Fri, Sep 2, 2011 at 1:22 PM, Nate McCall wrote: > It looks like you only have 2 replicas configured in each data center? > > If so, LOCAL_QUORUM cannot be achieved with a host down same as with > QUORUM on RF=2 in a single DC cluster. > > On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev wrote: >> I believe I don't quite understand semantics of this exception: >> >> me.prettyprint.hector.api.exceptions.HUnavailableException: : May not >> be enough replicas present to handle consistency level. >> >> Does it mean there *might be* enough? >> Does it mean there *is not* enough? >> >> My case is as following - I have 3 nodes with keyspaces configured as >> following: >> >> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy >> Durable Writes: true >> Options: [DC2:2, DC1:2] >> >> Hector can only connect to nodes in DC1 and configured to neither see >> nor connect to nodes in DC2. This is for replication by Cassandra >> means, asynchronously between datacenters DC1 and DC2. Each of 6 total >> nodes can see any of the remaining 5. >> >> and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. >> However, this morning one node went down and I started seeing the >> HUnavailableException: : May not be enough replicas present to handle >> consistency level. >> >> I believed if I have 3 nodes and one goes down, two remaining nodes >> are sufficient for my configuration. >> >> Please help me to understand what's going on. >> >
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
In your options, you have configured 2 replicas for each data center: Options: [DC2:2, DC1:2] If one of those replicas is down, then LOCAL_QUORUM will fail as there is only one replica left 'locally.' On Fri, Sep 2, 2011 at 3:35 PM, Oleg Tsvinev wrote: > from http://www.datastax.com/docs/0.8/consistency/index: > > 2 + 1 with any resulting fractions rounded down.> > > I have RF=2, so majority of replicas is 2/2+1=2 which I have after 3rd > node goes down? > > On Fri, Sep 2, 2011 at 1:22 PM, Nate McCall wrote: >> It looks like you only have 2 replicas configured in each data center? >> >> If so, LOCAL_QUORUM cannot be achieved with a host down same as with >> QUORUM on RF=2 in a single DC cluster. >> >> On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev wrote: >>> I believe I don't quite understand semantics of this exception: >>> >>> me.prettyprint.hector.api.exceptions.HUnavailableException: : May not >>> be enough replicas present to handle consistency level. >>> >>> Does it mean there *might be* enough? >>> Does it mean there *is not* enough? >>> >>> My case is as following - I have 3 nodes with keyspaces configured as >>> following: >>> >>> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy >>> Durable Writes: true >>> Options: [DC2:2, DC1:2] >>> >>> Hector can only connect to nodes in DC1 and configured to neither see >>> nor connect to nodes in DC2. This is for replication by Cassandra >>> means, asynchronously between datacenters DC1 and DC2. Each of 6 total >>> nodes can see any of the remaining 5. >>> >>> and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. >>> However, this morning one node went down and I started seeing the >>> HUnavailableException: : May not be enough replicas present to handle >>> consistency level. >>> >>> I believed if I have 3 nodes and one goes down, two remaining nodes >>> are sufficient for my configuration. >>> >>> Please help me to understand what's going on. >>> >> >
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
Do you mean I need to configure 3 replicas in each DC and keep using LOCAL_QUORUM? In which case, if I'm following your logic, even one of the 3 goes down I'll still have 2 to ensure LOCAL_QUORUM succeeds? On Fri, Sep 2, 2011 at 1:44 PM, Nate McCall wrote: > In your options, you have configured 2 replicas for each data center: > Options: [DC2:2, DC1:2] > > If one of those replicas is down, then LOCAL_QUORUM will fail as there > is only one replica left 'locally.' > > > On Fri, Sep 2, 2011 at 3:35 PM, Oleg Tsvinev wrote: >> from http://www.datastax.com/docs/0.8/consistency/index: >> >> > 2 + 1 with any resulting fractions rounded down.> >> >> I have RF=2, so majority of replicas is 2/2+1=2 which I have after 3rd >> node goes down? >> >> On Fri, Sep 2, 2011 at 1:22 PM, Nate McCall wrote: >>> It looks like you only have 2 replicas configured in each data center? >>> >>> If so, LOCAL_QUORUM cannot be achieved with a host down same as with >>> QUORUM on RF=2 in a single DC cluster. >>> >>> On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev wrote: I believe I don't quite understand semantics of this exception: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level. Does it mean there *might be* enough? Does it mean there *is not* enough? My case is as following - I have 3 nodes with keyspaces configured as following: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [DC2:2, DC1:2] Hector can only connect to nodes in DC1 and configured to neither see nor connect to nodes in DC2. This is for replication by Cassandra means, asynchronously between datacenters DC1 and DC2. Each of 6 total nodes can see any of the remaining 5. and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. However, this morning one node went down and I started seeing the HUnavailableException: : May not be enough replicas present to handle consistency level. I believed if I have 3 nodes and one goes down, two remaining nodes are sufficient for my configuration. Please help me to understand what's going on. >>> >> >
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
Yes - you would need at least 3 replicas per data center to use LOCAL_QUORUM and survive a node failure. On Fri, Sep 2, 2011 at 3:51 PM, Oleg Tsvinev wrote: > Do you mean I need to configure 3 replicas in each DC and keep using > LOCAL_QUORUM? In which case, if I'm following your logic, even one of > the 3 goes down I'll still have 2 to ensure LOCAL_QUORUM succeeds? > > On Fri, Sep 2, 2011 at 1:44 PM, Nate McCall wrote: >> In your options, you have configured 2 replicas for each data center: >> Options: [DC2:2, DC1:2] >> >> If one of those replicas is down, then LOCAL_QUORUM will fail as there >> is only one replica left 'locally.' >> >> >> On Fri, Sep 2, 2011 at 3:35 PM, Oleg Tsvinev wrote: >>> from http://www.datastax.com/docs/0.8/consistency/index: >>> >>> >> 2 + 1 with any resulting fractions rounded down.> >>> >>> I have RF=2, so majority of replicas is 2/2+1=2 which I have after 3rd >>> node goes down? >>> >>> On Fri, Sep 2, 2011 at 1:22 PM, Nate McCall wrote: It looks like you only have 2 replicas configured in each data center? If so, LOCAL_QUORUM cannot be achieved with a host down same as with QUORUM on RF=2 in a single DC cluster. On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev wrote: > I believe I don't quite understand semantics of this exception: > > me.prettyprint.hector.api.exceptions.HUnavailableException: : May not > be enough replicas present to handle consistency level. > > Does it mean there *might be* enough? > Does it mean there *is not* enough? > > My case is as following - I have 3 nodes with keyspaces configured as > following: > > Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy > Durable Writes: true > Options: [DC2:2, DC1:2] > > Hector can only connect to nodes in DC1 and configured to neither see > nor connect to nodes in DC2. This is for replication by Cassandra > means, asynchronously between datacenters DC1 and DC2. Each of 6 total > nodes can see any of the remaining 5. > > and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. > However, this morning one node went down and I started seeing the > HUnavailableException: : May not be enough replicas present to handle > consistency level. > > I believed if I have 3 nodes and one goes down, two remaining nodes > are sufficient for my configuration. > > Please help me to understand what's going on. > >>> >> >
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
I think that Oleg may have misunderstood how replicas are selected. If you have 3 nodes in your cluster and a RF of 2, Cassandra first selects what two nodes, out of the 3 will get data, then, and only then does it write it out. The selection is based on the row key, the token of the node, and you choice of partitioner. This means that Cassandra does not need to store what node is responsible for a given row. That information can be recalculated whenever it is needed. The error that you are getting is because you may have 2 nodes up, those are not the nodes that Cassandra will use to store data. - Original Message - From: "Nate McCall" To: hector-us...@googlegroups.com Cc: "Cassandra Users" Sent: Friday, September 2, 2011 4:44:01 PM Subject: Re: HUnavailableException: : May not be enough replicas present to handle consistency level. In your options, you have configured 2 replicas for each data center: Options: [DC2:2, DC1:2] If one of those replicas is down, then LOCAL_QUORUM will fail as there is only one replica left 'locally.' On Fri, Sep 2, 2011 at 3:35 PM, Oleg Tsvinev wrote: > from http://www.datastax.com/docs/0.8/consistency/index: > > 2 + 1 with any resulting fractions rounded down.> > > I have RF=2, so majority of replicas is 2/2+1=2 which I have after 3rd > node goes down? > > On Fri, Sep 2, 2011 at 1:22 PM, Nate McCall wrote: >> It looks like you only have 2 replicas configured in each data center? >> >> If so, LOCAL_QUORUM cannot be achieved with a host down same as with >> QUORUM on RF=2 in a single DC cluster. >> >> On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev wrote: >>> I believe I don't quite understand semantics of this exception: >>> >>> me.prettyprint.hector.api.exceptions.HUnavailableException: : May not >>> be enough replicas present to handle consistency level. >>> >>> Does it mean there *might be* enough? >>> Does it mean there *is not* enough? >>> >>> My case is as following - I have 3 nodes with keyspaces configured as >>> following: >>> >>> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy >>> Durable Writes: true >>> Options: [DC2:2, DC1:2] >>> >>> Hector can only connect to nodes in DC1 and configured to neither see >>> nor connect to nodes in DC2. This is for replication by Cassandra >>> means, asynchronously between datacenters DC1 and DC2. Each of 6 total >>> nodes can see any of the remaining 5. >>> >>> and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. >>> However, this morning one node went down and I started seeing the >>> HUnavailableException: : May not be enough replicas present to handle >>> consistency level. >>> >>> I believed if I have 3 nodes and one goes down, two remaining nodes >>> are sufficient for my configuration. >>> >>> Please help me to understand what's going on. >>> >> >
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
And now, when I have one node down with no chance of bringing it back anytime soon, can I still change RF to 3 and get restore functionality of my cluster? Should I run 'nodetool repair' or simple keyspace update will suffice? On Fri, Sep 2, 2011 at 1:55 PM, Nate McCall wrote: > Yes - you would need at least 3 replicas per data center to use > LOCAL_QUORUM and survive a node failure. > > On Fri, Sep 2, 2011 at 3:51 PM, Oleg Tsvinev wrote: >> Do you mean I need to configure 3 replicas in each DC and keep using >> LOCAL_QUORUM? In which case, if I'm following your logic, even one of >> the 3 goes down I'll still have 2 to ensure LOCAL_QUORUM succeeds? >> >> On Fri, Sep 2, 2011 at 1:44 PM, Nate McCall wrote: >>> In your options, you have configured 2 replicas for each data center: >>> Options: [DC2:2, DC1:2] >>> >>> If one of those replicas is down, then LOCAL_QUORUM will fail as there >>> is only one replica left 'locally.' >>> >>> >>> On Fri, Sep 2, 2011 at 3:35 PM, Oleg Tsvinev wrote: from http://www.datastax.com/docs/0.8/consistency/index: >>> 2 + 1 with any resulting fractions rounded down.> I have RF=2, so majority of replicas is 2/2+1=2 which I have after 3rd node goes down? On Fri, Sep 2, 2011 at 1:22 PM, Nate McCall wrote: > It looks like you only have 2 replicas configured in each data center? > > If so, LOCAL_QUORUM cannot be achieved with a host down same as with > QUORUM on RF=2 in a single DC cluster. > > On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev > wrote: >> I believe I don't quite understand semantics of this exception: >> >> me.prettyprint.hector.api.exceptions.HUnavailableException: : May not >> be enough replicas present to handle consistency level. >> >> Does it mean there *might be* enough? >> Does it mean there *is not* enough? >> >> My case is as following - I have 3 nodes with keyspaces configured as >> following: >> >> Replication Strategy: >> org.apache.cassandra.locator.NetworkTopologyStrategy >> Durable Writes: true >> Options: [DC2:2, DC1:2] >> >> Hector can only connect to nodes in DC1 and configured to neither see >> nor connect to nodes in DC2. This is for replication by Cassandra >> means, asynchronously between datacenters DC1 and DC2. Each of 6 total >> nodes can see any of the remaining 5. >> >> and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. >> However, this morning one node went down and I started seeing the >> HUnavailableException: : May not be enough replicas present to handle >> consistency level. >> >> I believed if I have 3 nodes and one goes down, two remaining nodes >> are sufficient for my configuration. >> >> Please help me to understand what's going on. >> > >>> >> >
Re: HUnavailableException: : May not be enough replicas present to handle consistency level.
Yes, I think I get it now. "quorum of replicas" != "quorum of nodes" and I don't think quorum of nodes is ever defined. Thank you, Konstantin. Now, I believe I need to change my cluster to store data in two remaining nodes in DC1, keeping 3 nodes in DC2. I believe nodetool removetoken is what I need to use. Anything else I can/should do? On Fri, Sep 2, 2011 at 1:56 PM, Konstantin Naryshkin wrote: > I think that Oleg may have misunderstood how replicas are selected. If you > have 3 nodes in your cluster and a RF of 2, Cassandra first selects what two > nodes, out of the 3 will get data, then, and only then does it write it out. > The selection is based on the row key, the token of the node, and you choice > of partitioner. This means that Cassandra does not need to store what node is > responsible for a given row. That information can be recalculated whenever it > is needed. > > The error that you are getting is because you may have 2 nodes up, those are > not the nodes that Cassandra will use to store data. > > - Original Message - > From: "Nate McCall" > To: hector-us...@googlegroups.com > Cc: "Cassandra Users" > Sent: Friday, September 2, 2011 4:44:01 PM > Subject: Re: HUnavailableException: : May not be enough replicas present to > handle consistency level. > > In your options, you have configured 2 replicas for each data center: > Options: [DC2:2, DC1:2] > > If one of those replicas is down, then LOCAL_QUORUM will fail as there > is only one replica left 'locally.' > > > On Fri, Sep 2, 2011 at 3:35 PM, Oleg Tsvinev wrote: >> from http://www.datastax.com/docs/0.8/consistency/index: >> >> > 2 + 1 with any resulting fractions rounded down.> >> >> I have RF=2, so majority of replicas is 2/2+1=2 which I have after 3rd >> node goes down? >> >> On Fri, Sep 2, 2011 at 1:22 PM, Nate McCall wrote: >>> It looks like you only have 2 replicas configured in each data center? >>> >>> If so, LOCAL_QUORUM cannot be achieved with a host down same as with >>> QUORUM on RF=2 in a single DC cluster. >>> >>> On Fri, Sep 2, 2011 at 1:40 PM, Oleg Tsvinev wrote: I believe I don't quite understand semantics of this exception: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level. Does it mean there *might be* enough? Does it mean there *is not* enough? My case is as following - I have 3 nodes with keyspaces configured as following: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [DC2:2, DC1:2] Hector can only connect to nodes in DC1 and configured to neither see nor connect to nodes in DC2. This is for replication by Cassandra means, asynchronously between datacenters DC1 and DC2. Each of 6 total nodes can see any of the remaining 5. and inserts with LOCAL_QUORUM CL work fine when all 3 nodes are up. However, this morning one node went down and I started seeing the HUnavailableException: : May not be enough replicas present to handle consistency level. I believed if I have 3 nodes and one goes down, two remaining nodes are sufficient for my configuration. Please help me to understand what's going on. >>> >> >
Import JSON sstable data
Hi, I try to upload sstable data on cassandra 0.8.4 cluster with json2sstable tool. Each time I have to restart the node with new file imported and do repair for the column family, otherwise new data will not show. Any thoughts? Thanks, Zhong Li
Re: Limiting ColumnSlice range in second composite value
This is what I'm trying to do: Sample of the data: RowKey: localhost => (column=e3f3c900-d5b0-11e0-aa6b-005056c8:ACTIVE, value=, timestamp=1315001665761000) => (column=e4515250-d5b0-11e0-aa6b-005056c8:INACTIVE, value=, timestamp=1315001654271000) => (column=e45549f0-d5b0-11e0-aa6b-005056c8:INACTIVE, value=, timestamp=1315001654327000) => (column=e45cc400-d5b0-11e0-aa6b-005056c8:INACTIVE, value=, timestamp=1315001654355000) => (column=e462de80-d5b0-11e0-aa6b-005056c8:INACTIVE, value=, timestamp=1315001654394000) I'll be activating and deactivating the inactive profiles in a chronological order. - So I want to first retrieve current "ACTIVE" record (easy cause it's cached) - Put it to use and when ready, recreate the column - same timeUUID but "EXHAUSTED" status (delete then add) - Next I have to fetch the first "INACTIVE" column after this, delete that and re-create the record with an "ACTIVE" composite (same timeuuid, again add then delete) and repeat the process. The second part of my composite is an ENUM of String literals: Status.ACTIVE, Status.INACTIVE, Status.EXHAUSTED I want to get the current row key of value (startTimeUUID, "ACTIVE") which should only be one column (provided the code works) All earlier columns are (timeUUID, "EXHAUSTED"), all later columns should be (timeUUID, "INACTIVE") I'm thinking to find the column that is "ACTIVE" I would set the range: startComp = new Composite(timeUUID, "ACTIVE"); endComp = new Composite(timeUUID, ""ACTIVE_"); query.setRange(startComp, endComp, false, 2); //Fetch 2 just in case To get all "INACTIVE" columns I'd use startComp = new Composite(timeUUID, "INACTIVE"); endComp = new Composite(timeUUID, ""INACTIVE_"); query.setRange(startComp, endComp, false, 10); Thing is I'm getting back all columns regardless of what I set for the second half of the composite. Is what I'm trying to do possible? Anthony On Fri, Sep 2, 2011 at 12:29 PM, Nate McCall wrote: > Instead of empty strings, try Character.[MAX|MIN-]_VALUE. > > On Thu, Sep 1, 2011 at 8:27 PM, Anthony Ikeda > wrote: > > My Column name is of Composite(TimeUUIDType, UTF8Type) and I can query > > across the TimeUUIDs correctly, but now I want to also range across the > UTF8 > > component. Is this possible? > > > > UUID start = uuidForDate(new Date(1979, 1, 1)); > > > > UUID end = uuidForDate(new Date(Long.MAX_VALUE)); > > > > String startState = ""; > > > > String endState = ""; > > > > if (desiredState != null) { > > > > mLog.debug("Restricting state to [" + desiredState.getValue() + "]"); > > > > startState = desiredState.getValue(); > > > > endState = desiredState.getValue().concat("_"); > > > > } > > > > > > > > Composite startComp = new Composite(start, startState); > > > > Composite endComp = new Composite(end, endState); > > > > query.setRange(startComp, endComp, true, count); > > > > So far I'm not seeing any effect setting my "endState" String value. > > > > Anthony >
commodity server spec
Hi, Is there any recommendation about commodity server hardware specs if 100TB database size is expected and its heavily write application. Should I got with high powered CPU (12 cores) and 48TB HDD and 640GB RAM and total of 3 servers of this spec. Or many smaller commodity servers are recommended? Thanks. China
Re: Limiting ColumnSlice range in second composite value
Okay, I reversed the composite and seem to have come up with a solution. Although the rows are sorted by the status, the statuses are sorted temporally which helps. I tell you this type of modeling really breaks the rules :) Anthony On Fri, Sep 2, 2011 at 3:54 PM, Anthony Ikeda wrote: > This is what I'm trying to do: > > Sample of the data: > RowKey: localhost > => (column=e3f3c900-d5b0-11e0-aa6b-005056c8:ACTIVE, value= version="1.0" encoding="UTF-8" standalone="yes"?>, > timestamp=1315001665761000) > => (column=e4515250-d5b0-11e0-aa6b-005056c8:INACTIVE, value= version="1.0" encoding="UTF-8" standalone="yes"?>, > timestamp=1315001654271000) > => (column=e45549f0-d5b0-11e0-aa6b-005056c8:INACTIVE, value= version="1.0" encoding="UTF-8" standalone="yes"?>, > timestamp=1315001654327000) > => (column=e45cc400-d5b0-11e0-aa6b-005056c8:INACTIVE, value= version="1.0" encoding="UTF-8" standalone="yes"?>, > timestamp=1315001654355000) > => (column=e462de80-d5b0-11e0-aa6b-005056c8:INACTIVE, value= version="1.0" encoding="UTF-8" standalone="yes"?>, > timestamp=1315001654394000) > > > I'll be activating and deactivating the inactive profiles in a > chronological order. > > >- So I want to first retrieve current "ACTIVE" record (easy cause it's >cached) >- Put it to use and when ready, recreate the column - same timeUUID but >"EXHAUSTED" status (delete then add) >- Next I have to fetch the first "INACTIVE" column after this, delete >that and re-create the record with an "ACTIVE" composite (same timeuuid, >again add then delete) and repeat the process. > > > The second part of my composite is an ENUM of String literals: > Status.ACTIVE, Status.INACTIVE, Status.EXHAUSTED > > I want to get the current row key of value (startTimeUUID, "ACTIVE") which > should only be one column (provided the code works) > > All earlier columns are (timeUUID, "EXHAUSTED"), all later columns should > be (timeUUID, "INACTIVE") > > I'm thinking to find the column that is "ACTIVE" I would set the range: > > startComp = new Composite(timeUUID, "ACTIVE"); > endComp = new Composite(timeUUID, ""ACTIVE_"); > > query.setRange(startComp, endComp, false, 2); //Fetch 2 just in case > > To get all "INACTIVE" columns I'd use > startComp = new Composite(timeUUID, "INACTIVE"); > endComp = new Composite(timeUUID, ""INACTIVE_"); > > query.setRange(startComp, endComp, false, 10); > > Thing is I'm getting back all columns regardless of what I set for the > second half of the composite. Is what I'm trying to do possible? > > Anthony > > > On Fri, Sep 2, 2011 at 12:29 PM, Nate McCall wrote: > >> Instead of empty strings, try Character.[MAX|MIN-]_VALUE. >> >> On Thu, Sep 1, 2011 at 8:27 PM, Anthony Ikeda >> wrote: >> > My Column name is of Composite(TimeUUIDType, UTF8Type) and I can query >> > across the TimeUUIDs correctly, but now I want to also range across the >> UTF8 >> > component. Is this possible? >> > >> > UUID start = uuidForDate(new Date(1979, 1, 1)); >> > >> > UUID end = uuidForDate(new Date(Long.MAX_VALUE)); >> > >> > String startState = ""; >> > >> > String endState = ""; >> > >> > if (desiredState != null) { >> > >> > mLog.debug("Restricting state to [" + desiredState.getValue() + >> "]"); >> > >> > startState = desiredState.getValue(); >> > >> > endState = desiredState.getValue().concat("_"); >> > >> > } >> > >> > >> > >> > Composite startComp = new Composite(start, startState); >> > >> > Composite endComp = new Composite(end, endState); >> > >> > query.setRange(startComp, endComp, true, count); >> > >> > So far I'm not seeing any effect setting my "endState" String value. >> > >> > Anthony >> > >
Re: commodity server spec
many smaller servers way better