Re: Could CouchDB 2.0 fix actual read quorum?

Paul Davis Tue, 31 Mar 2015 12:22:59 -0700

Sounds like there's a bit of confusion here.

What Nathan is asking for is the ability to have Couch respond with some
information on the actual number of replicas that responded to a read
request. That way a user could tell that they issued an r=2 request when
only r=1 was actually performed. Depending on your point of view in an MVCC
world this is either a bug or a feature. :)


It was generally agreed upon that if we could return this information it
would be beneficial. Although what happened when I started implementing
this patch was that we are either only able to return it in a subset of
cases where it happens, return it inconsistently between various responses,
or break replication.

The three general methods for this would be to either include a new
"_r_met" key in the doc body that would be a boolean indicating if the
requested read quorum was actually met for the document. The second was to
return a custom X-R-Met type header, and lastly was the status code as
described.

The _r_met member was thought to be the best, but unfortunately that breaks
replication with older clients because we throw an error rather than ignore
any unknown underscore prefixed field name. Thus having something that was
just dynamically injected into the document body was a non-starter.
Unfortunately, if we don't inject into the document body then we limit
ourselves to only the set of APIs where a single document is returned. This
is due to both streaming semantics (we can't buffer an entire response in
memory for large requests to _all_docs) as well as multi-doc responses (a
single boolean doesn't say which document may have not had a properly met
R).

On top of that, the other confusing part of meeting the read quorum is that
given MVCC semantics it becomes a bit confusing on how you respond to
documents with different revision histories. For instance, if we read two
docs, we have technically made the r=2 requirement, but what should our
response be if those two revisions are different (technically, in this case
we wait for the third response, but the decision on what to return for the
"r met" value is still unclear).

While I think everyone is in agreement that it'd be nice to return some of
the information about the copies read, I think its much less clear what and
how it should be returned in the multitude of cases that we can specify an
value for R.

While that doesn't offer a concrete path forward, hopefully it clarifies
some of the issues at hand.

On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <rnew...@apache.org>
wrote:

>
> It’s testament to my friendship with Mike that we can disagree on such
> things and remain friends. I am sorry he misled you, though.
>
> CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at
> least in the formal sense, the only one that matters, this is unfortunately
> sloppy language in too many places to correct.
>
> The r= and w= parameters control only how many of the n possible responses
> are collected before returning an http response.
>
> It’s not true that returning 202 in the situation where one write is made
> but fewer than 'r' writes are made means we’ve chosen availability over
> consistency since even if we returned a 500 or closed the connection
> without responding, a subsequent GET could return the document (a
> probability that increases over time as anti-entropy makes the missing
> copies). A write attempt that returned a 409 could, likewise, introduce a
> new edit branch into the document, which might then 'win', altering the
> results of a subsequent GET.
>
> The essential thing to remember is this: the ’n’ copies of your data are
> completely independent when written/read by the clustered layer (fabric).
> It is internal replication (anti-entropy) that converges those copies,
> pair-wise, to the same eventual state. Fabric is converting the 3
> independent results into a single result as best it can. Older versions did
> not expose the 201 vs 202 distinction, calling both of them 201. I do agree
> with you that there’s little value in the 202 distinction. About the only
> thing you could do is investigate your cluster for connectivity issues or
> overloading if you get a sustained period of 202’s, as it would be an
> indicator that the system is partitioned.
>
> In order to achieve your goals, CouchDB 2.0 would have to ensure that the
> result of a write did not change after the fact. That is, anti-entropy
> would need to be disabled, or somehow agree to roll forward or backward
> based on the initial circumstances. In short, we’d have to introduce strong
> consistency (paxos or raft or zab, say). While this would be a great
> feature to add, it’s not currently present, and no amount of twiddling the
> status codes will achieve it. We’d rather be honest about our position on
> the CAP triangle.
>
> B.
>
>
> > On 30 Mar 2015, at 22:37, Nathan Vander Wilt <nate-li...@calftrail.com>
> wrote:
> >
> > A technical co-founder of Cloudant agreed that this was a bug when I
> first hit it a few years ago. I found back the original thread here — this
> is the discussion I was trying to recall in my OP:
> > It sounds like perhaps there is a related issue tracked internally at
> Cloudant as a result of that conversation.
> >
> > JamesM, thanks for your support here and tracking this down. 203 seemed
> like the best status code to "steal" for this to me too. Best wishes in
> getting this fixed!
> >
> > regards,
> > -natevw
> >
> >
> > On Mar 25, 2015, at 4:49 AM, Robert Newson <rnew...@apache.org> wrote:
> >
> >> 2.0 is explicitly an AP system, the behaviour you describe is not
> classified as a bug.
> >>
> >> Anti-entropy is the main reason that you cannot get strong consistency
> from the system, it will transform "failed" writes (those that succeeded on
> one node but fewer than R nodes) into success (N copies) as long as the
> nodes have enough healthy uptime.
> >>
> >> True of cloudant and 2.0.
> >>
> >> Sent from my iPhone
> >>
> >>> On 24 Mar 2015, at 15:14, Mutton, James <jmut...@akamai.com> wrote:
> >>>
> >>> Funny you should mention it.  I drafted an email in early February to
> queue up the same discussion whenever I could get involved again (which I
> promptly forgot about).  What happens currently in 2.0 appears unchanged
> from earlier versions.  When R is not satisfied in fabric,
> fabric_doc_open:handle_message eventually responds with a {stop, …}  but
> leaves the acc-state as the original r_not_met which triggers a read_repair
> from the response handler.  read_repair results in an {ok, …} with the only
> doc available, because no other docs are in the list.  The final doc
> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is
> simply {ok, Doc}, which has now lost the fact that the answer was not
> complete.
> >>>
> >>> This seems straightforward to fix by a change in
> fabric_open_doc:handle_response and read_repair.  handle_response knows
> whether it has R met and could pass that forward, or allow read-repair to
> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for
> community interest in the behavior of sending a 202, but it’s something I’d
> definitely like for the same reasons you cite.  Plus it just seems
> disconnected to do it on writes but not reads.
> >>>
> >>> Cheers,
> >>> </JamesM>
> >>>
> >>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
> nate-li...@calftrail.com> wrote:
> >>>>
> >>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
> extending my fermata-couchdb plugin today and realized that perhaps the
> Apache release of BigCouch as CouchDB 2.0 might provide an opportunity to
> fix a serious issue I had using Cloudant's implementation.
> >>>>
> >>>> See
> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 for
> some additional background/explanation, but my understanding is that
> Cloudant for all practical purposes ignores the read durability parameter.
> So you can write with ?w=N to attempt some level of quorum, and get a 202
> back if that quorum is unment. _However_ when you ?r=N it really doesn't
> matter if only <N nodes are available…if even just a single available node
> has some version of the requested document you will get a successful
> response (!).
> >>>>
> >>>> So in practice, there's no way to actually use the quasi-Dynamo
> features to dynamically _choose_ between consistency or availability — when
> it comes time to read back a consistent result, BigCouch instead just
> always gives you availability* regardless of what a given request actually
> needs. (In my usage I ended up treating a 202 write as a 500, rather than
> proceeding with no way of ever knowing whether a write did NOT ACTUALLY
> conflict or just hadn't YET because $who_knows_how_many nodes were still
> down…)
> >>>>
> >>>> IIRC, this was both confirmed and acknowledged as a serious bug by a
> Cloudant engineer (or support personnel at least) but could not be quickly
> fixed as it could introduce backwards-compatibility concerns. So…
> >>>>
> >>>> Is CouchDB 2.0 already breaking backwards compatibility with
> BigCouch? If true, could this read durability issue now be fixed during the
> merge?
> >>>>
> >>>> thanks,
> >>>> -natevw
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> * DISCLAIMER: this statement has not been endorsed by actual uptime
> of *any* Couch fork…
> >>>
> >
>
>

Re: Could CouchDB 2.0 fix actual read quorum?

Reply via email to