Re: Rack aware question.

Jack Krupansky Wed, 23 Mar 2016 16:57:12 -0700

CL=ALL also means that you won't have HA (High Availability) - if even a
single node goes down, you're out of business. I mean, HA is the
fundamental reason for using the rack-aware policy - to assure that each
replica is on a separate power supply and network connection so that data
can be retrieved even when a rack-level failure occurs.


In short, if CL=ALL is acceptable, then you might as well dump the
rack-aware approach, which was how you got into this situation in the first
place.

-- Jack Krupansky

On Wed, Mar 23, 2016 at 7:31 PM, Anubhav Kale <anubhav.k...@microsoft.com>
wrote:

> I ran into the following detail from :
> https://wiki.apache.org/cassandra/ReadRepair
>
>
>
> “If a lower ConsistencyLevel than ALL was specified, this is done in the
> background after returning the data from the closest replica to the client;
> otherwise, it is done before returning the data.”
>
>
>
> I set consistency to ALL, and now I can get data all the time.
>
>
>
> *From:* Anubhav Kale [mailto:anubhav.k...@microsoft.com]
> *Sent:* Wednesday, March 23, 2016 4:14 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* RE: Rack aware question.
>
>
>
> Thanks, Read repair is what I thought must be causing this, so I
> experimented some more with setting read_repair_chance and
> dc_local_read_repair_chance on the table to 0, and then 1.
>
>
>
> Unfortunately, the results were somewhat random depending on which node I
> ran the queries from. For example, when chance = 1, running query from
> 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see
> digest-mismatch-kicking-off-read-repair in traces in both cases, so running
> out of ideas here.  If you / someone can shed light on why this could be
> happening, that would be great !
>
>
>
> That said, is it expected that “read repair” or a regular “nodetool
> repair” will shift the data around based on new replica placement ? And, if
> so is the recommendation to “rebootstrap” to mainly avoid this humongous
> data movement ?
>
>
>
> The rationale behind ignore_rack flag makes sense, thanks. Maybe, we
> should document it better ?
>
>
>
> Thanks !
>
>
>
> *From:* Paulo Motta [mailto:pauloricard...@gmail.com
> <pauloricard...@gmail.com>]
> *Sent:* Wednesday, March 23, 2016 3:40 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> > How come 127.0.0.1 is shown as an endpoint holding the ID when its token
> range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for
> a node or just the primary range ? I am thinking its only primary. Can
> someone confirm ?
>
> The primary replica of id=1 is always 127.0.0.3. What changes when you
> change racks is that the secondary replica will move to the next replica
> from a different rack, either 127.0.0.1 or 127.0.0.2.
>
> > How come queries contact 127.0.0.1 ?
>
> in the last case, 127.0.0.1 is the next node after the primary replica
> from a different rack (R2), so it should be contacted
>
> > Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ?
> To prove / disprove that, I stopped 127.0.0.2 and ran a query with
> CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold
> the data (SS Tables also show it). So, does this mean that the data
> actually gets moved around when racks change ?
>
> probably during some of your queries 127.0.0.3 (the primary replica)
> replicated data to 127.0.0.1 with read repair. There is no automatic data
> move when rack is changed (at least in OSS C*, not sure if DSE has this
> ability)
>
> > If we don’t want to support this ever, I’d think the ignore_rack flag
> should just be deprecated.
>
> ignore_rack flag can be useful if you move your data manually, with rsync
> or sstableloader.
>
>
>
> 2016-03-23 19:09 GMT-03:00 Anubhav Kale <anubhav.k...@microsoft.com>:
>
> Thanks for the pointer – appreciate it.
>
>
>
> My test is on the latest trunk and slightly different.
>
>
>
> I am not exactly sure if the behavior I see is expected (in which case, is
> the recommendation to re-bootstrap just to avoid data movement?) or is the
> behavior not expected and is a bug.
>
>
>
> If we don’t want to support this ever, I’d think the ignore_rack flag
> should just be deprecated.
>
>
>
> *From:* Robert Coli [mailto:rc...@eventbrite.com]
> *Sent:* Wednesday, March 23, 2016 2:54 PM
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> Actually, I believe you are seeing the behavior described in the ticket I
> meant to link to, with the detailed exploration :
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-10238
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>
>
>
>
> =Rob
>
>
>
>
>
> On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <anubhav.k...@microsoft.com>
> wrote:
>
> Oh, and the query I ran was “select * from racktest.racktable where id=1”
>
>
>
> *From:* Anubhav Kale [mailto:anubhav.k...@microsoft.com]
> *Sent:* Wednesday, March 23, 2016 2:04 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Rack aware question.
>
>
>
> Thanks.
>
>
>
> To test what happens when rack of a node changes in a running cluster
> without doing a decommission, I did the following.
>
>
>
> The cluster looks like below (this was run through Eclipse, therefore the
> IP address hack)
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R1
>
> R1
>
> R2
>
>
>
> A table was created and a row inserted as follows:
>
>
>
> Cqlsh 127.0.0.1
>
> >create keyspace racktest with replication = { 'class' :
> 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>
> >create table racktest.racktable(id int, PRIMARY KEY(id));
>
> >insert into racktest.racktable(id) values(1);
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.2
>
> 127.0.0.3
>
>
>
> Nodetool ring > ring_1.txt (attached)
>
>
>
> So far so good.
>
>
>
> Then I changed the racks to below and restarted DSE with
> –Dcassandra.ignore_rack=true.
>
> This option from my finding simply avoids the check on startup that
> compares the rack in system.local with the one in rack-dc.properties.
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R1
>
> R2
>
> R1
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.2
>
> 127.0.0.3
>
>
>
> So far so good, cqlsh returns the queries fine.
>
>
>
> Nodetool ring > ring_2.txt (attached)
>
>
>
> Now comes the interesting part.
>
>
>
> I changed the racks to below and restarted DSE.
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R2
>
> R1
>
> R1
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.*1*
>
> 127.0.0.3
>
>
>
> This is *very* interesting, cqlsh returns the queries fine. With tracing
> on, it’s clear that the 127.0.0.1 is being asked for data as well.
>
>
>
> Nodetool ring > ring_3.txt (attached)
>
>
>
> There is no change in token information in ring_* files. The token under
> question for id=1 (from select token(id) from racktest.racktable) is
> -4069959284402364209.
>
>
>
> So, few questions because things don’t add up:
>
>
>
>    1. How come 127.0.0.1 is shown as an endpoint holding the ID when its
>    token range doesn’t contain it ? Does “nodetool ring” shows all
>    token-ranges for a node or just the primary range ? I am thinking its only
>    primary. Can someone confirm ?
>    2. How come queries contact 127.0.0.1 ?
>    3. Is “getendpoints” acting odd here and the data really is on
>    127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query
>    with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed
>    hold the data (SS Tables also show it).
>    4. So, does this mean that the data actually gets moved around when
>    racks change ?
>
>
>
> Thanks !
>
>
>
>
>
> *From:* Robert Coli [mailto:rc...@eventbrite.com <rc...@eventbrite.com>]
> *Sent:* Wednesday, March 23, 2016 11:59 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <anubhav.k...@microsoft.com>
> wrote:
>
> Suppose we change the racks on VMs on a running cluster. (We need to do
> this while running on Azure, because sometimes when the VM gets moved its
> rack changes).
>
>
>
> In this situation, new writes will be laid out based on new rack info on
> appropriate replicas. What happens for existing data ? Is that data moved
> around as well and does it happen if we run repair or on its own ?
>
>
>
> First, you should understand this ticket if relying on rack awareness :
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-3810
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>
>
>
>
> Second, in general nodes cannot move between racks.
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-10242
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>
>
>
>
> Has some detailed explanations of what blows up if they do.
>
>
>
> Note that if you want to preserve any of the data on the node, you need to
> :
>
>
>
> 1) bring it and have it join the ring in its new rack (during which time
> it will serve incorrect reads due to missing data)
>
> 2) stop it
>
> 3) run cleanup
>
> 4) run repair
>
> 5) start it again
>
>
>
> Can't really say that I recommend this practice, but it's better than
> "rebootstrap it" which is the official advice. If you "rebootstrap it" you
> decrease unique replica count by 1, which has a nonzero chance of
> data-loss. The Coli Conjecture says that in practice you probably don't
> care about this nonzero chance of data loss if you are running your
> application in CL.ONE, which should be all cases where it matters.
>
>
>
> =Rob
>
>
>
>
>
>
>

Re: Rack aware question.

Reply via email to