Ok, I took this opportunity to look look a bit more on this part of
the code. My reading of StorageProxy.fetchRows() and related is as
follows, but please allow for others to say I'm wrong/missing
something (and sorry, this is more a stream of consciousness that is
probably more useful to me for learning the code than in answer to
your question, but it's probably better to send it than write it and
then just discard the e-mail - maybe someone is helped;)):

The endpoints obtained is sorted by the snitch's sortByProximity,
against the local address. If the closest (as determined by that
sorting) is the local address, the request is added directly to the
local READ stage.

In the case of the SimpleSnitch, sortByProximity is a no-op, so the
"sorted by proximity" should be the ring order. As per the comments to
the SimpleSnitch, the intent is to allow "non-read-repaired reads to
prefer a single endpoint, which improves cache locality".

So my understand is that in the case of the SImpleSnitch, ignoring any
effect of the dynamic snitch, you will *not* always grab from the
local node because the "closest" node (because ring order is used) is
just whatever is the "first" node on the ring in the replica set.

In the case of the NetworkTopologyStrategy, it inherits the
implementation in AbstractNetworkTopologySnitch which sorts by
AbstractNetworkTopologySnitch.compareEndPoints(), which:

(1) Always prefers itself to any other node. So "myself" is always
"closest", no matter what.
(2) Else, always prefers a node in the same rack, to a node in a different rack.
(3) Else, always prefers a node in the same dc, to a node in a different dc.

So in the NTS case, I believe, *disregarding the dynamic snitch*, that
with NTS you would in fact always read from the co-ordinator node if
that node happens to be part of the replica set for the row.

(There is no tie-breaking if neither 1, 2 nor 3 above gives a
presedence, and it is sorted with Collections.sort(), which guarantees
that the sort is stable. So for nodes where rack/dc awareness does not
come into play, it should result in the ring order as with the
SimpleSnitch.)

Now; so far this only determines the order of endpoints after
proximity sorting. fetchRows() will route to "itself" directly without
messaging if the closest node is itself. This determines from which
node we read the *data* (not digest).

Moving back to endpoint selection; after sorting by proximity it is
actually filtered by getReadCallback. This is what determines how many
will be receiving a request. If read repair doesn't happen, it'll be
whatever is implied by the consistency level (so only one for CL.ONE).
If read repair does happen, all endpoints are included and so none is
filtered out.

Moving back out into fetchRows(), we're now past the sending or local
scheduling of the data read. It then loops over the remainder (1
through last of handler.endpoints) and submitting digest read messages
to each endpoint (either local or remote).

We're now so far as to have determined (1) which node to send data
request to, (2) which nodes, if any, to send digest reads to
(regardless of whether it is due to read repair or consistency level
requirements).

Now fetchRows() proceeds to iterate over all the ReadCallbacks,
get():Ing each. This is where digest mismatch exceptions are raised if
relevant. CL.ONE seems special-cased in the sense that if the number
of responses to block/wait for is exactly 1, the data is returned
without resolving to check for digest mismatches (once responses come
back later on, the read repair is triggered by
ReadCallback.maybeResolveForRepair).

In the case of CL > ONE, a digest mismatch can be raised immediately
in which case fetchRows() triggers read repair.

Now:

> However case (C) as I have described it does not allow for any notion of
> 'pinning' as mentioned for dynamic_snitch_badness_threshold:
>
> # if set greater than zero and read_repair_chance is < 1.0, this will allow
> # 'pinning' of replicas to hosts in order to increase cache capacity.
> # The badness threshold will control how much worse the pinned host has
> to be
> # before the dynamic snitch will prefer other replicas over it.  This is
> # expressed as a double which represents a percentage.  Thus, a value of
> # 0.2 means Cassandra would continue to prefer the static snitch values
> # until the pinned host was 20% worse than the fastest.

If you look at DynamicEndpointSnitch.sortByProximity(), it branches
into two main cases: If BADNESS_THRESHOLD is exactly 0 (it's not a
constant despite the caps, it's taken from the conf) is uses
sortByProximityWithScore(). Otherwise it uses
sortByProximityWithBadness().

...withBadness() first asks the subsnitch (meaning normally either
SImpleSnitch or NTS) to sort by proximity. Then it iterates through
the endpoints, and if any node is sufficiently good in comparison to
the closest-as-determined-by-subsnitch endpoint, it falls back to
sortByProximityWithScore(). "Sufficiently good" is where the actual
value of BADNESS_THRESHOLD comes in (if the "would-be" closest node is
sufficiently bad, that implies some other node is sufficiently good in
comparison to it...)

So, my reading of it is then that the comment is correct. By setting
it to >0, you're making the dynamic snitch essentially be a no-op for
the purpose of proximity sorting *until* the badness threshold is
reached, at which point it uses it's scoring algorithm. Because the
behavior of NTS and the SImpleSnitch is to always prefer the same node
(for a given row key), that means pinning.

As far as I can tell, read-repair should not affect things either way
since it doesn't have anything to do with which node gets asked for
the data (as opposed to the digest).

One interesting aspect though: Say you specifically *don't* want
pinning, and rather *want* round-robin type of behavior to keep caches
hot. If you're at CL.ONE with read repair turned off or very low, this
doesn't seem possible except as may result accidentally by dynamic
snitch balancing - depending on performance characteristics of nodes.

-- 
/ Peter Schuller

Reply via email to