Re: Could CouchDB 2.0 fix actual read quorum?

Adam Kocoloski Tue, 31 Mar 2015 13:10:49 -0700

I hope we can all agree that CouchDB should inform the user when it is unable 
to satisfy the requested read "quorum".


Adam

> On Mar 31, 2015, at 3:20 PM, Paul Davis <paul.joseph.da...@gmail.com> wrote:
> 
> Sounds like there's a bit of confusion here.
> 
> What Nathan is asking for is the ability to have Couch respond with some
> information on the actual number of replicas that responded to a read
> request. That way a user could tell that they issued an r=2 request when
> only r=1 was actually performed. Depending on your point of view in an MVCC
> world this is either a bug or a feature. :)
> 
> It was generally agreed upon that if we could return this information it
> would be beneficial. Although what happened when I started implementing
> this patch was that we are either only able to return it in a subset of
> cases where it happens, return it inconsistently between various responses,
> or break replication.
> 
> The three general methods for this would be to either include a new
> "_r_met" key in the doc body that would be a boolean indicating if the
> requested read quorum was actually met for the document. The second was to
> return a custom X-R-Met type header, and lastly was the status code as
> described.
> 
> The _r_met member was thought to be the best, but unfortunately that breaks
> replication with older clients because we throw an error rather than ignore
> any unknown underscore prefixed field name. Thus having something that was
> just dynamically injected into the document body was a non-starter.
> Unfortunately, if we don't inject into the document body then we limit
> ourselves to only the set of APIs where a single document is returned. This
> is due to both streaming semantics (we can't buffer an entire response in
> memory for large requests to _all_docs) as well as multi-doc responses (a
> single boolean doesn't say which document may have not had a properly met
> R).
> 
> On top of that, the other confusing part of meeting the read quorum is that
> given MVCC semantics it becomes a bit confusing on how you respond to
> documents with different revision histories. For instance, if we read two
> docs, we have technically made the r=2 requirement, but what should our
> response be if those two revisions are different (technically, in this case
> we wait for the third response, but the decision on what to return for the
> "r met" value is still unclear).
> 
> While I think everyone is in agreement that it'd be nice to return some of
> the information about the copies read, I think its much less clear what and
> how it should be returned in the multitude of cases that we can specify an
> value for R.
> 
> While that doesn't offer a concrete path forward, hopefully it clarifies
> some of the issues at hand.
> 
> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <rnew...@apache.org>
> wrote:
> 
>> 
>> It’s testament to my friendship with Mike that we can disagree on such
>> things and remain friends. I am sorry he misled you, though.
>> 
>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at
>> least in the formal sense, the only one that matters, this is unfortunately
>> sloppy language in too many places to correct.
>> 
>> The r= and w= parameters control only how many of the n possible responses
>> are collected before returning an http response.
>> 
>> It’s not true that returning 202 in the situation where one write is made
>> but fewer than 'r' writes are made means we’ve chosen availability over
>> consistency since even if we returned a 500 or closed the connection
>> without responding, a subsequent GET could return the document (a
>> probability that increases over time as anti-entropy makes the missing
>> copies). A write attempt that returned a 409 could, likewise, introduce a
>> new edit branch into the document, which might then 'win', altering the
>> results of a subsequent GET.
>> 
>> The essential thing to remember is this: the ’n’ copies of your data are
>> completely independent when written/read by the clustered layer (fabric).
>> It is internal replication (anti-entropy) that converges those copies,
>> pair-wise, to the same eventual state. Fabric is converting the 3
>> independent results into a single result as best it can. Older versions did
>> not expose the 201 vs 202 distinction, calling both of them 201. I do agree
>> with you that there’s little value in the 202 distinction. About the only
>> thing you could do is investigate your cluster for connectivity issues or
>> overloading if you get a sustained period of 202’s, as it would be an
>> indicator that the system is partitioned.
>> 
>> In order to achieve your goals, CouchDB 2.0 would have to ensure that the
>> result of a write did not change after the fact. That is, anti-entropy
>> would need to be disabled, or somehow agree to roll forward or backward
>> based on the initial circumstances. In short, we’d have to introduce strong
>> consistency (paxos or raft or zab, say). While this would be a great
>> feature to add, it’s not currently present, and no amount of twiddling the
>> status codes will achieve it. We’d rather be honest about our position on
>> the CAP triangle.
>> 
>> B.
>> 
>> 
>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <nate-li...@calftrail.com>
>> wrote:
>>> 
>>> A technical co-founder of Cloudant agreed that this was a bug when I
>> first hit it a few years ago. I found back the original thread here — this
>> is the discussion I was trying to recall in my OP:
>>> It sounds like perhaps there is a related issue tracked internally at
>> Cloudant as a result of that conversation.
>>> 
>>> JamesM, thanks for your support here and tracking this down. 203 seemed
>> like the best status code to "steal" for this to me too. Best wishes in
>> getting this fixed!
>>> 
>>> regards,
>>> -natevw
>>> 
>>> 
>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rnew...@apache.org> wrote:
>>> 
>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>> classified as a bug.
>>>> 
>>>> Anti-entropy is the main reason that you cannot get strong consistency
>> from the system, it will transform "failed" writes (those that succeeded on
>> one node but fewer than R nodes) into success (N copies) as long as the
>> nodes have enough healthy uptime.
>>>> 
>>>> True of cloudant and 2.0.
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jmut...@akamai.com> wrote:
>>>>> 
>>>>> Funny you should mention it.  I drafted an email in early February to
>> queue up the same discussion whenever I could get involved again (which I
>> promptly forgot about).  What happens currently in 2.0 appears unchanged
>> from earlier versions.  When R is not satisfied in fabric,
>> fabric_doc_open:handle_message eventually responds with a {stop, …}  but
>> leaves the acc-state as the original r_not_met which triggers a read_repair
>> from the response handler.  read_repair results in an {ok, …} with the only
>> doc available, because no other docs are in the list.  The final doc
>> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is
>> simply {ok, Doc}, which has now lost the fact that the answer was not
>> complete.
>>>>> 
>>>>> This seems straightforward to fix by a change in
>> fabric_open_doc:handle_response and read_repair.  handle_response knows
>> whether it has R met and could pass that forward, or allow read-repair to
>> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for
>> community interest in the behavior of sending a 202, but it’s something I’d
>> definitely like for the same reasons you cite.  Plus it just seems
>> disconnected to do it on writes but not reads.
>>>>> 
>>>>> Cheers,
>>>>> </JamesM>
>>>>> 
>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>> nate-li...@calftrail.com> wrote:
>>>>>> 
>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>> extending my fermata-couchdb plugin today and realized that perhaps the
>> Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to
>> fix a serious issue I had using Cloudant's implementation.
>>>>>> 
>>>>>> See
>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for
>> some additional background/explanation, but my understanding is that
>> Cloudant for all practical purposes ignores the read durability parameter.
>> So you can write with ?w=N to attempt some level of quorum, and get a 202
>> back if that quorum is unment. _However_ when you ?r=N it really doesn't
>> matter if only <N nodes are available…if even just a single available node
>> has some version of the requested document you will get a successful
>> response (!).
>>>>>> 
>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>> features to dynamically _choose_ between consistency or availability — when
>> it comes time to read back a consistent result, BigCouch instead just
>> always gives you availability* regardless of what a given request actually
>> needs. (In my usage I ended up treating a 202 write as a 500, rather than
>> proceeding with no way of ever knowing whether a write did NOT ACTUALLY
>> conflict or just hadn't YET because $who_knows_how_many nodes were still
>> down…)
>>>>>> 
>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a
>> Cloudant engineer (or support personnel at least) but could not be quickly
>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>> 
>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>> BigCouch? If true, could this read durability issue now be fixed during the
>> merge?
>>>>>> 
>>>>>> thanks,
>>>>>> -natevw
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime
>> of *any* Couch fork…
>>>>> 
>>> 
>> 
>>

Re: Could CouchDB 2.0 fix actual read quorum?

Reply via email to