Agreed with Jack. I don't think there's ever a reason to use CL=ALL in an application in production. I would only use it if I was debugging certain types of consistency problems.
On Wed, Mar 23, 2016 at 4:56 PM Jack Krupansky <jack.krupan...@gmail.com> wrote: > CL=ALL also means that you won't have HA (High Availability) - if even a > single node goes down, you're out of business. I mean, HA is the > fundamental reason for using the rack-aware policy - to assure that each > replica is on a separate power supply and network connection so that data > can be retrieved even when a rack-level failure occurs. > > In short, if CL=ALL is acceptable, then you might as well dump the > rack-aware approach, which was how you got into this situation in the first > place. > > -- Jack Krupansky > > On Wed, Mar 23, 2016 at 7:31 PM, Anubhav Kale <anubhav.k...@microsoft.com> > wrote: > >> I ran into the following detail from : >> https://wiki.apache.org/cassandra/ReadRepair >> >> >> >> “If a lower ConsistencyLevel than ALL was specified, this is done in the >> background after returning the data from the closest replica to the client; >> otherwise, it is done before returning the data.” >> >> >> >> I set consistency to ALL, and now I can get data all the time. >> >> >> >> *From:* Anubhav Kale [mailto:anubhav.k...@microsoft.com] >> *Sent:* Wednesday, March 23, 2016 4:14 PM >> >> *To:* user@cassandra.apache.org >> *Subject:* RE: Rack aware question. >> >> >> >> Thanks, Read repair is what I thought must be causing this, so I >> experimented some more with setting read_repair_chance and >> dc_local_read_repair_chance on the table to 0, and then 1. >> >> >> >> Unfortunately, the results were somewhat random depending on which node I >> ran the queries from. For example, when chance = 1, running query from >> 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see >> digest-mismatch-kicking-off-read-repair in traces in both cases, so running >> out of ideas here. If you / someone can shed light on why this could be >> happening, that would be great ! >> >> >> >> That said, is it expected that “read repair” or a regular “nodetool >> repair” will shift the data around based on new replica placement ? And, if >> so is the recommendation to “rebootstrap” to mainly avoid this humongous >> data movement ? >> >> >> >> The rationale behind ignore_rack flag makes sense, thanks. Maybe, we >> should document it better ? >> >> >> >> Thanks ! >> >> >> >> *From:* Paulo Motta [mailto:pauloricard...@gmail.com >> <pauloricard...@gmail.com>] >> *Sent:* Wednesday, March 23, 2016 3:40 PM >> *To:* user@cassandra.apache.org >> *Subject:* Re: Rack aware question. >> >> >> >> > How come 127.0.0.1 is shown as an endpoint holding the ID when its >> token range doesn’t contain it ? Does “nodetool ring” shows all >> token-ranges for a node or just the primary range ? I am thinking its only >> primary. Can someone confirm ? >> >> The primary replica of id=1 is always 127.0.0.3. What changes when you >> change racks is that the secondary replica will move to the next replica >> from a different rack, either 127.0.0.1 or 127.0.0.2. >> >> > How come queries contact 127.0.0.1 ? >> >> in the last case, 127.0.0.1 is the next node after the primary replica >> from a different rack (R2), so it should be contacted >> >> > Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? >> To prove / disprove that, I stopped 127.0.0.2 and ran a query with >> CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold >> the data (SS Tables also show it). So, does this mean that the data >> actually gets moved around when racks change ? >> >> probably during some of your queries 127.0.0.3 (the primary replica) >> replicated data to 127.0.0.1 with read repair. There is no automatic data >> move when rack is changed (at least in OSS C*, not sure if DSE has this >> ability) >> >> > If we don’t want to support this ever, I’d think the ignore_rack flag >> should just be deprecated. >> >> ignore_rack flag can be useful if you move your data manually, with rsync >> or sstableloader. >> >> >> >> 2016-03-23 19:09 GMT-03:00 Anubhav Kale <anubhav.k...@microsoft.com>: >> >> Thanks for the pointer – appreciate it. >> >> >> >> My test is on the latest trunk and slightly different. >> >> >> >> I am not exactly sure if the behavior I see is expected (in which case, >> is the recommendation to re-bootstrap just to avoid data movement?) or is >> the behavior not expected and is a bug. >> >> >> >> If we don’t want to support this ever, I’d think the ignore_rack flag >> should just be deprecated. >> >> >> >> *From:* Robert Coli [mailto:rc...@eventbrite.com] >> *Sent:* Wednesday, March 23, 2016 2:54 PM >> >> >> *To:* user@cassandra.apache.org >> *Subject:* Re: Rack aware question. >> >> >> >> Actually, I believe you are seeing the behavior described in the ticket I >> meant to link to, with the detailed exploration : >> >> >> >> https://issues.apache.org/jira/browse/CASSANDRA-10238 >> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d> >> >> >> >> =Rob >> >> >> >> >> >> On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <anubhav.k...@microsoft.com> >> wrote: >> >> Oh, and the query I ran was “select * from racktest.racktable where id=1” >> >> >> >> *From:* Anubhav Kale [mailto:anubhav.k...@microsoft.com] >> *Sent:* Wednesday, March 23, 2016 2:04 PM >> *To:* user@cassandra.apache.org >> *Subject:* RE: Rack aware question. >> >> >> >> Thanks. >> >> >> >> To test what happens when rack of a node changes in a running cluster >> without doing a decommission, I did the following. >> >> >> >> The cluster looks like below (this was run through Eclipse, therefore the >> IP address hack) >> >> >> >> *IP* >> >> 127.0.0.1 >> >> 127.0.0.2 >> >> 127.0.0.3 >> >> *Rack* >> >> R1 >> >> R1 >> >> R2 >> >> >> >> A table was created and a row inserted as follows: >> >> >> >> Cqlsh 127.0.0.1 >> >> >create keyspace racktest with replication = { 'class' : >> 'NetworkTopologyStrategy', 'datacenter1' : 2 }; >> >> >create table racktest.racktable(id int, PRIMARY KEY(id)); >> >> >insert into racktest.racktable(id) values(1); >> >> >> >> nodetool getendpoints racktest racktable 1 >> >> >> >> 127.0.0.2 >> >> 127.0.0.3 >> >> >> >> Nodetool ring > ring_1.txt (attached) >> >> >> >> So far so good. >> >> >> >> Then I changed the racks to below and restarted DSE with >> –Dcassandra.ignore_rack=true. >> >> This option from my finding simply avoids the check on startup that >> compares the rack in system.local with the one in rack-dc.properties. >> >> >> >> *IP* >> >> 127.0.0.1 >> >> 127.0.0.2 >> >> 127.0.0.3 >> >> *Rack* >> >> R1 >> >> R2 >> >> R1 >> >> >> >> nodetool getendpoints racktest racktable 1 >> >> >> >> 127.0.0.2 >> >> 127.0.0.3 >> >> >> >> So far so good, cqlsh returns the queries fine. >> >> >> >> Nodetool ring > ring_2.txt (attached) >> >> >> >> Now comes the interesting part. >> >> >> >> I changed the racks to below and restarted DSE. >> >> >> >> *IP* >> >> 127.0.0.1 >> >> 127.0.0.2 >> >> 127.0.0.3 >> >> *Rack* >> >> R2 >> >> R1 >> >> R1 >> >> >> >> nodetool getendpoints racktest racktable 1 >> >> >> >> 127.0.0.*1* >> >> 127.0.0.3 >> >> >> >> This is *very* interesting, cqlsh returns the queries fine. With tracing >> on, it’s clear that the 127.0.0.1 is being asked for data as well. >> >> >> >> Nodetool ring > ring_3.txt (attached) >> >> >> >> There is no change in token information in ring_* files. The token under >> question for id=1 (from select token(id) from racktest.racktable) is >> -4069959284402364209. >> >> >> >> So, few questions because things don’t add up: >> >> >> >> 1. How come 127.0.0.1 is shown as an endpoint holding the ID when its >> token range doesn’t contain it ? Does “nodetool ring” shows all >> token-ranges for a node or just the primary range ? I am thinking its only >> primary. Can someone confirm ? >> 2. How come queries contact 127.0.0.1 ? >> 3. Is “getendpoints” acting odd here and the data really is on >> 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query >> with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed >> hold the data (SS Tables also show it). >> 4. So, does this mean that the data actually gets moved around when >> racks change ? >> >> >> >> Thanks ! >> >> >> >> >> >> *From:* Robert Coli [mailto:rc...@eventbrite.com <rc...@eventbrite.com>] >> *Sent:* Wednesday, March 23, 2016 11:59 AM >> *To:* user@cassandra.apache.org >> *Subject:* Re: Rack aware question. >> >> >> >> On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <anubhav.k...@microsoft.com> >> wrote: >> >> Suppose we change the racks on VMs on a running cluster. (We need to do >> this while running on Azure, because sometimes when the VM gets moved its >> rack changes). >> >> >> >> In this situation, new writes will be laid out based on new rack info on >> appropriate replicas. What happens for existing data ? Is that data moved >> around as well and does it happen if we run repair or on its own ? >> >> >> >> First, you should understand this ticket if relying on rack awareness : >> >> >> >> https://issues.apache.org/jira/browse/CASSANDRA-3810 >> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d> >> >> >> >> Second, in general nodes cannot move between racks. >> >> >> >> https://issues.apache.org/jira/browse/CASSANDRA-10242 >> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d> >> >> >> >> Has some detailed explanations of what blows up if they do. >> >> >> >> Note that if you want to preserve any of the data on the node, you need >> to : >> >> >> >> 1) bring it and have it join the ring in its new rack (during which time >> it will serve incorrect reads due to missing data) >> >> 2) stop it >> >> 3) run cleanup >> >> 4) run repair >> >> 5) start it again >> >> >> >> Can't really say that I recommend this practice, but it's better than >> "rebootstrap it" which is the official advice. If you "rebootstrap it" you >> decrease unique replica count by 1, which has a nonzero chance of >> data-loss. The Coli Conjecture says that in practice you probably don't >> care about this nonzero chance of data loss if you are running your >> application in CL.ONE, which should be all cases where it matters. >> >> >> >> =Rob >> >> >> >> >> >> >> > >