CL=ALL also means that you won't have HA (High Availability) - if even a single node goes down, you're out of business. I mean, HA is the fundamental reason for using the rack-aware policy - to assure that each replica is on a separate power supply and network connection so that data can be retrieved even when a rack-level failure occurs.
In short, if CL=ALL is acceptable, then you might as well dump the rack-aware approach, which was how you got into this situation in the first place. -- Jack Krupansky On Wed, Mar 23, 2016 at 7:31 PM, Anubhav Kale <anubhav.k...@microsoft.com> wrote: > I ran into the following detail from : > https://wiki.apache.org/cassandra/ReadRepair > > > > “If a lower ConsistencyLevel than ALL was specified, this is done in the > background after returning the data from the closest replica to the client; > otherwise, it is done before returning the data.” > > > > I set consistency to ALL, and now I can get data all the time. > > > > *From:* Anubhav Kale [mailto:anubhav.k...@microsoft.com] > *Sent:* Wednesday, March 23, 2016 4:14 PM > > *To:* user@cassandra.apache.org > *Subject:* RE: Rack aware question. > > > > Thanks, Read repair is what I thought must be causing this, so I > experimented some more with setting read_repair_chance and > dc_local_read_repair_chance on the table to 0, and then 1. > > > > Unfortunately, the results were somewhat random depending on which node I > ran the queries from. For example, when chance = 1, running query from > 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see > digest-mismatch-kicking-off-read-repair in traces in both cases, so running > out of ideas here. If you / someone can shed light on why this could be > happening, that would be great ! > > > > That said, is it expected that “read repair” or a regular “nodetool > repair” will shift the data around based on new replica placement ? And, if > so is the recommendation to “rebootstrap” to mainly avoid this humongous > data movement ? > > > > The rationale behind ignore_rack flag makes sense, thanks. Maybe, we > should document it better ? > > > > Thanks ! > > > > *From:* Paulo Motta [mailto:pauloricard...@gmail.com > <pauloricard...@gmail.com>] > *Sent:* Wednesday, March 23, 2016 3:40 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Rack aware question. > > > > > How come 127.0.0.1 is shown as an endpoint holding the ID when its token > range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for > a node or just the primary range ? I am thinking its only primary. Can > someone confirm ? > > The primary replica of id=1 is always 127.0.0.3. What changes when you > change racks is that the secondary replica will move to the next replica > from a different rack, either 127.0.0.1 or 127.0.0.2. > > > How come queries contact 127.0.0.1 ? > > in the last case, 127.0.0.1 is the next node after the primary replica > from a different rack (R2), so it should be contacted > > > Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? > To prove / disprove that, I stopped 127.0.0.2 and ran a query with > CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold > the data (SS Tables also show it). So, does this mean that the data > actually gets moved around when racks change ? > > probably during some of your queries 127.0.0.3 (the primary replica) > replicated data to 127.0.0.1 with read repair. There is no automatic data > move when rack is changed (at least in OSS C*, not sure if DSE has this > ability) > > > If we don’t want to support this ever, I’d think the ignore_rack flag > should just be deprecated. > > ignore_rack flag can be useful if you move your data manually, with rsync > or sstableloader. > > > > 2016-03-23 19:09 GMT-03:00 Anubhav Kale <anubhav.k...@microsoft.com>: > > Thanks for the pointer – appreciate it. > > > > My test is on the latest trunk and slightly different. > > > > I am not exactly sure if the behavior I see is expected (in which case, is > the recommendation to re-bootstrap just to avoid data movement?) or is the > behavior not expected and is a bug. > > > > If we don’t want to support this ever, I’d think the ignore_rack flag > should just be deprecated. > > > > *From:* Robert Coli [mailto:rc...@eventbrite.com] > *Sent:* Wednesday, March 23, 2016 2:54 PM > > > *To:* user@cassandra.apache.org > *Subject:* Re: Rack aware question. > > > > Actually, I believe you are seeing the behavior described in the ticket I > meant to link to, with the detailed exploration : > > > > https://issues.apache.org/jira/browse/CASSANDRA-10238 > <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d> > > > > =Rob > > > > > > On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <anubhav.k...@microsoft.com> > wrote: > > Oh, and the query I ran was “select * from racktest.racktable where id=1” > > > > *From:* Anubhav Kale [mailto:anubhav.k...@microsoft.com] > *Sent:* Wednesday, March 23, 2016 2:04 PM > *To:* user@cassandra.apache.org > *Subject:* RE: Rack aware question. > > > > Thanks. > > > > To test what happens when rack of a node changes in a running cluster > without doing a decommission, I did the following. > > > > The cluster looks like below (this was run through Eclipse, therefore the > IP address hack) > > > > *IP* > > 127.0.0.1 > > 127.0.0.2 > > 127.0.0.3 > > *Rack* > > R1 > > R1 > > R2 > > > > A table was created and a row inserted as follows: > > > > Cqlsh 127.0.0.1 > > >create keyspace racktest with replication = { 'class' : > 'NetworkTopologyStrategy', 'datacenter1' : 2 }; > > >create table racktest.racktable(id int, PRIMARY KEY(id)); > > >insert into racktest.racktable(id) values(1); > > > > nodetool getendpoints racktest racktable 1 > > > > 127.0.0.2 > > 127.0.0.3 > > > > Nodetool ring > ring_1.txt (attached) > > > > So far so good. > > > > Then I changed the racks to below and restarted DSE with > –Dcassandra.ignore_rack=true. > > This option from my finding simply avoids the check on startup that > compares the rack in system.local with the one in rack-dc.properties. > > > > *IP* > > 127.0.0.1 > > 127.0.0.2 > > 127.0.0.3 > > *Rack* > > R1 > > R2 > > R1 > > > > nodetool getendpoints racktest racktable 1 > > > > 127.0.0.2 > > 127.0.0.3 > > > > So far so good, cqlsh returns the queries fine. > > > > Nodetool ring > ring_2.txt (attached) > > > > Now comes the interesting part. > > > > I changed the racks to below and restarted DSE. > > > > *IP* > > 127.0.0.1 > > 127.0.0.2 > > 127.0.0.3 > > *Rack* > > R2 > > R1 > > R1 > > > > nodetool getendpoints racktest racktable 1 > > > > 127.0.0.*1* > > 127.0.0.3 > > > > This is *very* interesting, cqlsh returns the queries fine. With tracing > on, it’s clear that the 127.0.0.1 is being asked for data as well. > > > > Nodetool ring > ring_3.txt (attached) > > > > There is no change in token information in ring_* files. The token under > question for id=1 (from select token(id) from racktest.racktable) is > -4069959284402364209. > > > > So, few questions because things don’t add up: > > > > 1. How come 127.0.0.1 is shown as an endpoint holding the ID when its > token range doesn’t contain it ? Does “nodetool ring” shows all > token-ranges for a node or just the primary range ? I am thinking its only > primary. Can someone confirm ? > 2. How come queries contact 127.0.0.1 ? > 3. Is “getendpoints” acting odd here and the data really is on > 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query > with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed > hold the data (SS Tables also show it). > 4. So, does this mean that the data actually gets moved around when > racks change ? > > > > Thanks ! > > > > > > *From:* Robert Coli [mailto:rc...@eventbrite.com <rc...@eventbrite.com>] > *Sent:* Wednesday, March 23, 2016 11:59 AM > *To:* user@cassandra.apache.org > *Subject:* Re: Rack aware question. > > > > On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <anubhav.k...@microsoft.com> > wrote: > > Suppose we change the racks on VMs on a running cluster. (We need to do > this while running on Azure, because sometimes when the VM gets moved its > rack changes). > > > > In this situation, new writes will be laid out based on new rack info on > appropriate replicas. What happens for existing data ? Is that data moved > around as well and does it happen if we run repair or on its own ? > > > > First, you should understand this ticket if relying on rack awareness : > > > > https://issues.apache.org/jira/browse/CASSANDRA-3810 > <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d> > > > > Second, in general nodes cannot move between racks. > > > > https://issues.apache.org/jira/browse/CASSANDRA-10242 > <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d> > > > > Has some detailed explanations of what blows up if they do. > > > > Note that if you want to preserve any of the data on the node, you need to > : > > > > 1) bring it and have it join the ring in its new rack (during which time > it will serve incorrect reads due to missing data) > > 2) stop it > > 3) run cleanup > > 4) run repair > > 5) start it again > > > > Can't really say that I recommend this practice, but it's better than > "rebootstrap it" which is the official advice. If you "rebootstrap it" you > decrease unique replica count by 1, which has a nonzero chance of > data-loss. The Coli Conjecture says that in practice you probably don't > care about this nonzero chance of data loss if you are running your > application in CL.ONE, which should be all cases where it matters. > > > > =Rob > > > > > > >