Re: Could CouchDB 2.0 fix actual read quorum?

Robert Samuel Newson Thu, 02 Apr 2015 02:37:44 -0700

Yeah, not a bad idea. An extra query arg (akin to open_revs=all, 
conflicts=true, etc) would avoid compatibility breaks and would clearly put the 
onus on those supplying it to tolerate the presence of the extra reserved field.


+1


> On 2 Apr 2015, at 10:32, Benjamin Bastian <bbast...@apache.org> wrote:
> 
> What about adding an optional query parameter to indicate whether or not
> Couch should include the _r_met flag in the document body/bodies
> (defaulting to false)? That wouldn't break older clients and it'd work for
> the bulk API as well. As far as the case where there are conflicts, it
> seems like the most intuitive thing would be for the "r" in "_r_met" to
> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
> for r copies of the same doc rev until a timeout" and "_r_met" would mean
> "we got/didn't get r copies of the same doc rev within the timeout").
> 
> Just my two cents.
> 
> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <rnew...@apache.org>
> wrote:
> 
>> 
>> Paul outlined his previous efforts to introduce this indication, and the
>> problems he faced doing so. Can we come up with an acceptable mechanism?
>> 
>> A different status code will break a lot of users. While the http spec
>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>> reads.
>> 
>> My preference is for a change that "can’t" break anyone, which I think
>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>> pleasant thing.
>> 
>> Suggestions?
>> 
>> B.
>> 
>> 
>>> On 1 Apr 2015, at 06:55, Mutton, James <jmut...@akamai.com> wrote:
>>> 
>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>> effort to inform in the case of a failure to apply W. I've seen it lead to
>> confusion when the same logic was not applied on R.
>>> 
>>> I also agree that W and R are not binding contracts. There's no
>> agreement protocol to assure that W is met before being committed to disk.
>> But they are exposed as a blocking parameter of the request, so
>> notification being consistent appeared to me to be the best compromise (vs
>> straight up removal).
>>> 
>>> </JamesM>
>>> 
>>> 
>>>> On Mar 31, 2015, at 13:15, Robert Newson <rnew...@apache.org> wrote:
>>>> 
>>>> 
>>>> If a way can be found that doesn't break things that can be sent in all
>> or most cases, sure. It's what a user can really infer from that which I
>> focused on. Not as much, I think, as users that want that info really want.
>>>> 
>>>> 
>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <kocol...@apache.org> wrote:
>>>>> 
>>>>> I hope we can all agree that CouchDB should inform the user when it is
>> unable to satisfy the requested read "quorum".
>>>>> 
>>>>> Adam
>>>>> 
>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <paul.joseph.da...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> Sounds like there's a bit of confusion here.
>>>>>> 
>>>>>> What Nathan is asking for is the ability to have Couch respond with
>> some
>>>>>> information on the actual number of replicas that responded to a read
>>>>>> request. That way a user could tell that they issued an r=2 request
>> when
>>>>>> only r=1 was actually performed. Depending on your point of view in
>> an MVCC
>>>>>> world this is either a bug or a feature. :)
>>>>>> 
>>>>>> It was generally agreed upon that if we could return this information
>> it
>>>>>> would be beneficial. Although what happened when I started
>> implementing
>>>>>> this patch was that we are either only able to return it in a subset
>> of
>>>>>> cases where it happens, return it inconsistently between various
>> responses,
>>>>>> or break replication.
>>>>>> 
>>>>>> The three general methods for this would be to either include a new
>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>> requested read quorum was actually met for the document. The second
>> was to
>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>> described.
>>>>>> 
>>>>>> The _r_met member was thought to be the best, but unfortunately that
>> breaks
>>>>>> replication with older clients because we throw an error rather than
>> ignore
>>>>>> any unknown underscore prefixed field name. Thus having something
>> that was
>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>> ourselves to only the set of APIs where a single document is
>> returned. This
>>>>>> is due to both streaming semantics (we can't buffer an entire
>> response in
>>>>>> memory for large requests to _all_docs) as well as multi-doc
>> responses (a
>>>>>> single boolean doesn't say which document may have not had a properly
>> met
>>>>>> R).
>>>>>> 
>>>>>> On top of that, the other confusing part of meeting the read quorum
>> is that
>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>> documents with different revision histories. For instance, if we read
>> two
>>>>>> docs, we have technically made the r=2 requirement, but what should
>> our
>>>>>> response be if those two revisions are different (technically, in
>> this case
>>>>>> we wait for the third response, but the decision on what to return
>> for the
>>>>>> "r met" value is still unclear).
>>>>>> 
>>>>>> While I think everyone is in agreement that it'd be nice to return
>> some of
>>>>>> the information about the copies read, I think its much less clear
>> what and
>>>>>> how it should be returned in the multitude of cases that we can
>> specify an
>>>>>> value for R.
>>>>>> 
>>>>>> While that doesn't offer a concrete path forward, hopefully it
>> clarifies
>>>>>> some of the issues at hand.
>>>>>> 
>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>> rnew...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>> such
>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>> 
>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>> all, at
>>>>>>> least in the formal sense, the only one that matters, this is
>> unfortunately
>>>>>>> sloppy language in too many places to correct.
>>>>>>> 
>>>>>>> The r= and w= parameters control only how many of the n possible
>> responses
>>>>>>> are collected before returning an http response.
>>>>>>> 
>>>>>>> It’s not true that returning 202 in the situation where one write is
>> made
>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>> over
>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>> probability that increases over time as anti-entropy makes the
>> missing
>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>> introduce a
>>>>>>> new edit branch into the document, which might then 'win', altering
>> the
>>>>>>> results of a subsequent GET.
>>>>>>> 
>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>> are
>>>>>>> completely independent when written/read by the clustered layer
>> (fabric).
>>>>>>> It is internal replication (anti-entropy) that converges those
>> copies,
>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>> independent results into a single result as best it can. Older
>> versions did
>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>> do agree
>>>>>>> with you that there’s little value in the 202 distinction. About the
>> only
>>>>>>> thing you could do is investigate your cluster for connectivity
>> issues or
>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>> indicator that the system is partitioned.
>>>>>>> 
>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>> that the
>>>>>>> result of a write did not change after the fact. That is,
>> anti-entropy
>>>>>>> would need to be disabled, or somehow agree to roll forward or
>> backward
>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>> strong
>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>> feature to add, it’s not currently present, and no amount of
>> twiddling the
>>>>>>> status codes will achieve it. We’d rather be honest about our
>> position on
>>>>>>> the CAP triangle.
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> 
>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>> nate-li...@calftrail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>> first hit it a few years ago. I found back the original thread here
>> — this
>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>> at
>>>>>>> Cloudant as a result of that conversation.
>>>>>>>> 
>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>> seemed
>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>> in
>>>>>>> getting this fixed!
>>>>>>>> 
>>>>>>>> regards,
>>>>>>>> -natevw
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rnew...@apache.org>
>> wrote:
>>>>>>>>> 
>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>> classified as a bug.
>>>>>>>>> 
>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>> consistency
>>>>>>> from the system, it will transform "failed" writes (those that
>> succeeded on
>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>> the
>>>>>>> nodes have enough healthy uptime.
>>>>>>>>> 
>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>> 
>>>>>>>>> Sent from my iPhone
>>>>>>>>> 
>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jmut...@akamai.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>> February to
>>>>>>> queue up the same discussion whenever I could get involved again
>> (which I
>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>> unchanged
>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>> but
>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>> read_repair
>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>> the only
>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>> chttpd_db:db_doc_req is
>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>> complete.
>>>>>>>>>> 
>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>> knows
>>>>>>> whether it has R met and could pass that forward, or allow
>> read-repair to
>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>> speak for
>>>>>>> community interest in the behavior of sending a 202, but it’s
>> something I’d
>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> </JamesM>
>>>>>>>>>> 
>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>> nate-li...@calftrail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>> the
>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>> opportunity to
>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>> 
>>>>>>>>>>> See
>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>> for
>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>> Cloudant for all practical purposes ignores the read durability
>> parameter.
>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>> a 202
>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>> doesn't
>>>>>>> matter if only <N nodes are available…if even just a single
>> available node
>>>>>>> has some version of the requested document you will get a successful
>>>>>>> response (!).
>>>>>>>>>>> 
>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>> features to dynamically _choose_ between consistency or availability
>> — when
>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>> always gives you availability* regardless of what a given request
>> actually
>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>> than
>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>> ACTUALLY
>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>> still
>>>>>>> down…)
>>>>>>>>>>> 
>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>> by a
>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>> quickly
>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>> 
>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>> during the
>>>>>>> merge?
>>>>>>>>>>> 
>>>>>>>>>>> thanks,
>>>>>>>>>>> -natevw
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>> uptime
>>>>>>> of *any* Couch fork…
>>>>> 
>> 
>>

Re: Could CouchDB 2.0 fix actual read quorum?

Reply via email to