On 13 Dec 2016, at 0:21, Shyam Prasad N wrote:

> Hi,
>
> I have an openstack swift cluster with 2 nodes, and a replication count of
> 2.
> So, theoretically, during a PUT request, both replicas are updated
> synchronously. Only then the request will return a success. Please correct
> me if I'm wrong on this.
>
> I have a script that periodically does a PUT to a small object with some
> random data, and then immediately GETs the object. On some occasions, I'm
> getting older data during the GET.
>
> Is my expectation above correct? Or is there some other setting needed to
> make the replication synchronous?


This is an interesting case of both Swift and your expectations being correct. 
But wait! How can that be when they seem to be at odds? Therein lies the fun* 
of Distributed Systems. Yay.

(*actually, not that much fun)

Ok, here's how it works. I'm assuming you have more than one hard drive on each 
of your two servers. When Swift gets the PUT request, the proxy will determine 
where the object data is supposed to be in the cluster. It does this via 
hashing and ring lookups (this is deterministic, but the details of that 
process aren't important here). The proxy will look for <replica count> places 
to put the data. In your case, this is 2. Because of the way the ring works, it 
will look for one drive on each of your two servers first. It will not put the 
data on two drives on one server. So in the Happy Path, the client makes a PUT 
request, the proxy sends the data to both replicas, and after both have been 
fsync'd, the client gets a 201 Created response. [1]

This is well and good, and the greatest part about it is that Swift can 
guarantee read-your-creates. That is, when you create a new object, you are 
immediately able to read it. However, what you describe is slightly different. 
You're overwriting an existing object, and sometimes you're getting back the 
older version of the object on a subsequent read. This is normal and expected. 
Read on for why.

The above process is the Happy Path for when there are no failures in the 
system. A failure could be a hardware failure, but it could also be some part 
of the system being overloaded. Spinning drives have very real physical limits 
to the amount of data they can read and write per unit time. An overloaded hard 
drive can cause a read or write request to time out, thus becoming a "failure" 
in the cluster.

So when you overwrite an object in Swift, the exact same process happens: the 
proxy finds the right locations, sends the data to all those locations, and 
returns a success if a quorum successfully fsync'd the data to disk.

However, what happens if there's a failure?

When the proxy determines the correct location for the object, it chooses what 
we call "primary" nodes. These are the canonical locations where the data is 
supposed to be right now. All the other drives in the cluster are called 
"handoff" nodes. For a given object, some nodes (<replica count> of them, to be 
exact) are primary nodes, and all the rest in the cluster are handoffs. For 
another object, a different set of nodes will be primary, and all the rest in 
the cluster are handoffs. This is the same regardless of how many replicas 
you're using or how many drives you have in the cluster.

So when there's a failure in the cluster and a write request comes in, what 
happens? Again, the proxy finds the primary nodes for the object and it tries 
to connect to them. However, if one (or more) can't be connected to, then the 
proxy will start trying to connect to handoff nodes. After the proxy gets 
<replica count> successful connections, it sends the data to those storage 
nodes, the data is fsync'd, and the client gets a successful response code 
(assuming at least a quorum were able to be fsync'd). Note that in your case 
with two replicas, if the primary nodes were extra busy (e.g. serving other 
requests) or actually failing (drives do that, pretty often, in fact), then the 
proxy will choose a handoff location to write the data. This means that even 
when the cluster has issues, your writes are still completely durably 
written.[2]

The read request path is very similar: primary nodes are chosen, one is 
selected at random, if the data is there, it's returned. If the data isn't 
there, the next primary is chosen, etc etc.

Ok, we're finally able to get down to answering your question.

Let's assume you have a busy drive in the cluster. You (over)write your object, 
the proxy looks up the primary nodes, sees that one is busy (i.e. gets a 
timeout), chooses a handoff location, writes the data, and you get a 201 
response. Since this is an overwrite, you've got an old version of the object 
on one primary, a new version on another primary, and a new version on a 
handoff node. Now you do the immediate GET. The proxy finds the primary nodes, 
randomly chooses one, and oh no! it chose the one with the old data. Since 
there's data there, that version of the object gets returned to the client, and 
you see the older version of the data. Eventually, the background replication 
daemon will ensure that the old version on the primary is updated and the new 
version on the handoff is removed.

The way all of this works is a deliberate design choice in Swift. It means that 
when there are failures in the cluster, Swift will continue to provide 
durability and availability for your data, despite sometimes providing a stale 
view of what's there. The technical term for this is "eventual consistency". 
The other model you can have in distributed systems is called "strong 
consistency". In strong consistency, you get durability but not availability 
(i.e. you'd get an error instead of success on the above scenario). However, 
you also won't get stale data. Neither is particularly better than the other; 
it really comes down to the use case. In general, eventually consistent systems 
can scale larger than strongly consistent systems, and eventually consistent 
systems can more easily provide geographic dispersion of your data (which Swift 
can do). Basically, the difference comes down to "what happens when there are 
failures in the system?".

I hope that helps. Please ask if you have other questions.

[1] To generalize this to *any* number of replicas, the proxy concurrently 
writes to <replica count> places and returns a success after a quorum have 
successfully been written. It's important to note that the <replica count> 
number of replicas will be written at the time of the request; it's *not* a 
write-once-and-replicate-later process. A quorum is determined by the replica 
count: it's half plus one for odd replica counts and half for even replicas. So 
in your case with 2 replicas, a quorum is 1. If you had 3 replicas, the quorum 
is 2.

[2] Thinking about this more, you'll see that this leads to a really cool 
scaling property of Swift: the system gets better (wrt performance, durability, 
and availability) as it gets bigger.


--John



>
> -- 
> -Shyam
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack@lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Reply via email to