This is a really important topic; thanks Nick for bringing it up. Sorry I 
didn’t comment earlier. I think Mike neatly captures my perspective with this 
bit:

>> Our current behaviour seems extremely subtle and, I'd argue, unexpected. It 
>> is hard to reason about if you really need a particular guarantee.
>> 
>> Is there an opportunity to clarify behaviour here, such that we really _do_ 
>> guarantee point-in-time within _any_ single request, but only do this by 
>> leveraging FoundationDB's transaction isolation semantics and as such are 
>> only able to offer this based on the 5s timeout in place? The request 
>> boundary offers a very clear cut, user-visible boundary. This would 
>> obviously need to cover reads/writes of single docs and so on as well as 
>> probably needing further work w.r.t. bulk docs etc.
>> 
>> This restriction may naturally loosen as FoundationDB improves and the 5s 
>> timeout may be increased.

It’d be great if we could agree on this use of serializable snapshot isolation 
under the hood for each response to a CouchDB API request (excepting _changes) 
as an optimal state.

Of course, we have this complicating factor of an existing API and a community 
of users running applications in production against that API :) As you can 
imagine from the above, I’d be opposed to A); I think that squanders a real 
opportunity that we have here with a new major version. I also think that the 
return on investment for F) is too low; a large portion of our production 
databases see a 24/7 write load so a code path that only activates when a DB is 
quiesced doesn’t get my vote.

When I look at the other options, I think it’s important to take a broader view 
and consider the user experience in the client libraries as well as the API. 
Our experience at IBM Cloud is that a large majority of API requests come from 
a well-defined set of client libraries, and as we consider non-trivial changes 
to the API we can look to those libraries as a way to smooth over the API 
breakage, and intelligently surface new capabilities even if the 
least-disruptive way to introduce them to the API is a bit janky.

As a concrete example, I would support an aggressive ceiling on `limit` and 
`skip` in the 4.0 API, while enhancing popular client libraries as needed to 
allow users to opt-in to automatic pagination through larger result sets.

Nick rightly points out that we don’t have a good way to declare a read version 
timeout when we’ve already streamed a portion of the result set to the client, 
which is something we ought consider even if we do apply the restrictions in 
E). I acknowledge that I may be opening a can of worms, but ... how much value 
do we derive from that streaming behavior if we aggressively limit the `limit`? 
We wouldn’t be holding that much data in memory on the CouchDB side, and I 
don’t think many of our clients are parsing half-completed JSON objects for 
anything beyond the _changes feed. Something to think about.

Cheers, Adam

> On Feb 25, 2020, at 2:52 PM, Nick Vatamaniuc <vatam...@gmail.com> wrote:
> 
> Hi Mike,
> 
> Good point about CouchDB not actually  providing point-in-time
> snapshots. I missed those cases when thinking about it.
> 
> I wonder if that points to defaulting to option A since it maintains
> the API compatibility and doesn't loosen the current constraints
> anyway. At least it will un-break the current version of the branch
> until we figure out something better. Otherwise  it's completely
> unusable for dbs with more than 200-300k documents.
> 
> I like the idea of returning a bookmark and a completed/not-completed
> flag. That is, it would be option D for _all_docs and map-reduce
> views, but instead of the complex continuation object it would be a
> base64-encoded, opaque object. Passing a bookmark back in as a
> parameter would be exclusive to passing in a  start, end, skip, limit,
> and direction (forward/reverse) parameters. For _all_dbs, and
> _dbs_info where we don't have a place for metadata rows, we might need
> a new API endpoint. And maybe that opens the door to expose more
> transactional features in the API in general...
> 
> Also, it seems B, C and F have too many corner cases and
> inconsistencies so they can probably be discarded, unless someone
> disagrees.
> 
> Configurable skips and limit maximums (E) may still be interesting.
> Though, they don't necessarily have to be related to transactions, but
> can instead be used to ensure streaming APIs are consumed in smaller
> chunks.
> 
> Cheers,
> -Nick
> 
> 
> 
> On Mon, Feb 24, 2020 at 7:26 AM Mike Rhodes <couc...@dx13.co.uk> wrote:
>> 
>> Nick,
>> 
>> Thanks for thinking this through, it's certainly subtle and very unclear 
>> what is a "good" solution :(
>> 
>> I have a couple of thoughts, firstly about the guarantees we currently offer 
>> and then wondering whether there is an opportunity to improve our API by 
>> offering a single guarantee across all request types rather than bifurcating 
>> guarantees.
>> 
>> ---
>> 
>> The first point is that, by my reasoning, CouchDB 2.x doesn't actually don't 
>> offer a point-in-time guarantee of the following sort currently. I read this 
>> as your saying Couch does offer this guarantee, apologies if I'm misreading:
>> 
>>> Document the API behavior change that it may
>>> present a view of the data is never a point-in-time[4] snapshot of the
>>> DB.
>> ...
>>> [4] For example they have a constraint that documents "a" and "z"
>>> cannot both be in the database at the same time. But when iterating
>>> it's possible that "a" was there at the start. Then by the end, "a"
>>> was removed and "z" added, so both "a" and "z" would appear in the
>>> emitted stream. Note that FoundationDB has APIs which exhibit the same
>>> "relaxed" constrains:
>>> https://apple.github.io/foundationdb/api-python.html#module-fdb.locality
>> 
>> I don't believe we offer this guarantee because different database shards 
>> will respond to the scatter-gather inherent to a single global query type 
>> request at different times. This means that, given the following sequence of 
>> events:
>> 
>> (1) The shard containing "a" may start returning at time N.
>> (2) "a" may be deleted at N+1, but (1) will still be streaming from time N.
>> (3) "z" may be written to a second shard at time N+2.
>> (4) that second shard may not start returning until time N+3.
>> 
>> By my reasoning, "a" and "z" could thus appear in the same result set in 
>> current CouchDB, even if they never actually appear in the primary data at 
>> the same time (regardless of latency of shard replicas coming into 
>> agreement), voiding [4].
>> 
>> By my reckoning, you have point-in-time across a query request when you are 
>> working with a single shard, meaning we do have point in time for two 
>> scenarios:
>> 
>> - Partitioned queries.
>> - Q=1 databases.
>> 
>> Albeit this guarantee is still talking about the point in time of a single 
>> shard's replica rather than all replicas, meaning that further requests may 
>> produce different results if the shards are not in agreement. Which can 
>> perhaps be fixed by using stable=true.
>> 
>> I _think_ the working here is correct, but I'd welcome corrections in my 
>> understanding!
>> 
>> ---
>> 
>> Our current behaviour seems extremely subtle and, I'd argue, unexpected. It 
>> is hard to reason about if you really need a particular guarantee.
>> 
>> Is there an opportunity to clarify behaviour here, such that we really _do_ 
>> guarantee point-in-time within _any_ single request, but only do this by 
>> leveraging FoundationDB's transaction isolation semantics and as such are 
>> only able to offer this based on the 5s timeout in place? The request 
>> boundary offers a very clear cut, user-visible boundary. This would 
>> obviously need to cover reads/writes of single docs and so on as well as 
>> probably needing further work w.r.t. bulk docs etc.
>> 
>> This restriction may naturally loosen as FoundationDB improves and the 5s 
>> timeout may be increased.
>> 
>> In this approach, my preference would be to add a closing line to the result 
>> stream to contain both a bookmark (based on the FoundationDB key perhaps 
>> rather than the index key of itself to avoid problems with skip/limit?) and 
>> a complete/not-complete boolean to enable clients to avoid the extra HTTP 
>> round-trip for completed result sets that Nick mentions.
>> 
>> ---
>> 
>> For option (F), I feel that the "it sometimes works and sometimes doesn't" 
>> effect of checking the update-seq to see if we can continue streaming will 
>> be a confusing experience. I also find something similar with option (A) 
>> where a single request covers potentially many points in time and so feels 
>> hard to reason about, although it's a bit less subtle than today.
>> 
>> Footnote [2] seems quite a major problem, however, with the single 
>> transaction approach and as Nick says, it is hard to pick "good" maximums 
>> for skip -- perhaps users need to just avoid use of these in the new system 
>> given its behaviour? It feels like there's a definite "against the grain" 
>> aspect to these.
>> 
>> --
>> Mike.
>> 
>> On Wed, 19 Feb 2020, at 22:39, Nick Vatamaniuc wrote:
>>> Hello everyone,
>>> 
>>> I'd like to discuss the shape and behavior of streaming APIs for CouchDB 4.x
>>> 
>>> By "streaming APIs" I mean APIs which stream data in row as it gets
>>> read from the database. These are the endpoints I was thinking of:
>>> 
>>> _all_docs, _all_dbs, _dbs_info  and query results
>>> 
>>> I want to focus on what happens when FoundationDB transactions
>>> time-out after 5 seconds. Currently, all those APIs except _changes[1]
>>> feeds, will crash or freeze. The reason is because the
>>> transaction_too_old error at the end of 5 seconds is retry-able by
>>> default, so the request handlers run again and end up shoving the
>>> whole request down the socket again, headers and all, which is
>>> obviously broken and not what we want.
>>> 
>>> There are few alternatives discussed in couchdb-dev channel. I'll
>>> present some behaviors but feel free to add more. Some ideas might
>>> have been discounted on the IRC discussion already but I'll present
>>> them anyway in case is sparks further conversation:
>>> 
>>> A) Do what _changes[1] feeds do. Start a new transaction and continue
>>> streaming the data from the next key after last emitted in the
>>> previous transaction. Document the API behavior change that it may
>>> present a view of the data is never a point-in-time[4] snapshot of the
>>> DB.
>>> 
>>> - Keeps the API shape the same as CouchDB <4.0. Client libraries
>>> don't have to change to continue using these CouchDB 4.0 endpoints
>>> - This is the easiest to implement since it would re-use the
>>> implementation for _changes feed (an extra option passed to the fold
>>> function).
>>> - Breaks API behavior if users relied on having a point-in-time[4]
>>> snapshot view of the data.
>>> 
>>> B) Simply end the stream. Let the users pass a `?transaction=true`
>>> param which indicates they are aware the stream may end early and so
>>> would have to paginate from the last emitted key with a skip=1. This
>>> will keep the request bodies the same as current CouchDB. However, if
>>> the users got all the data one request, they will end up wasting
>>> another request to see if there is more data available. If they didn't
>>> get any data they might have a too large of a skip value (see [2]) so
>>> would have to guess different values for start/end keys. Or impose max
>>> limit for the `skip` parameter.
>>> 
>>> C) End the stream and add a final metadata row like a "transaction":
>>> "timeout" at the end. That will let the user know to keep paginating
>>> from the last key onward. This won't work for `_all_dbs` and
>>> `_dbs_info`[3] Maybe let those two endpoints behave like _changes
>>> feeds and only use this for views and and _all_docs? If we like this
>>> choice, let's think what happens for those as I couldn't come up with
>>> anything decent there.
>>> 
>>> D) Same as C but to solve the issue with skips[2], emit a bookmark
>>> "key" of where the iteration stopped and the current "skip" and
>>> "limit" params, which would keep decreasing. Then user would pass
>>> those in "start_key=..." in the next request along with the limit and
>>> skip params. So something like "continuation":{"skip":599, "limit":5,
>>> "key":"..."}. This has the same issue with array results for
>>> `_all_dbs` and `_dbs_info`[3].
>>> 
>>> E) Enforce low `limit` and `skip` parameters. Enforce maximum values
>>> there such that response time is likely to fit in one transaction.
>>> This could be tricky as different runtime environments will have
>>> different characteristics. Also, if the timeout happens there isn't a
>>> a nice way to send an HTTP error since we already sent the 200
>>> response. The downside is that this might break how some users use the
>>> API, if say the are using large skips and limits already. Perhaps here
>>> we do both B and D, such that if users want transactional behavior,
>>> they specify that `transaction=true` param and only then we enforce
>>> low limit and skip maximums.
>>> 
>>> F) At least for `_all_docs` it seems providing a point-in-time
>>> snapshot view doesn't necessarily need to be tied to transaction
>>> boundaries. We could check the update sequence of the database at the
>>> start of the next transaction and if it hasn't changed we can continue
>>> emitting a consistent view. This can apply to C and D and would just
>>> determine when the stream ends. If there are no writes happening to
>>> the db, this could potential streams all the data just like option A
>>> would do. Not entirely sure if this would work for views.
>>> 
>>> So what do we think? I can see different combinations of options here,
>>> maybe even different for each API point. For example `_all_dbs`,
>>> `_dbs_info` are always A, and `_all_docs` and views default to A but
>>> have parameters to do F, etc.
>>> 
>>> Cheers,
>>> -Nick
>>> 
>>> Some footnotes:
>>> 
>>> [1] _changes feeds is the only one that works currently. It behaves as
>>> per RFC
>>> https://github.com/apache/couchdb-documentation/blob/master/rfcs/003-fdb-seq-index.md#access-patterns.
>>> That is, we continue streaming the data by resetting the transaction
>>> object and restarting from the last emitted key (db sequence in this
>>> case). However, because the transaction restarts if a document is
>>> updated while the streaming take place, it may appear in the _changes
>>> feed twice. That's a behavior difference from CouchDB < 4.0 and we'd
>>> have to document it, since previously we presented this point-in-time
>>> snapshot of the database from when we started streaming.
>>> 
>>> [2] Our streaming APIs have both skips and limits. Since FDB doesn't
>>> currently support efficient offsets for key selectors
>>> (https://apple.github.io/foundationdb/known-limitations.html#dont-use-key-selectors-for-paging)
>>> we implemented skip by iterating over the data. This means that a skip
>>> of say 100000 could keep timing out the transaction without yielding
>>> any data.
>>> 
>>> [3] _all_dbs and _dbs_info return a JSON array so they don't have an
>>> obvious place to insert a last metadata row.
>>> 
>>> [4] For example they have a constraint that documents "a" and "z"
>>> cannot both be in the database at the same time. But when iterating
>>> it's possible that "a" was there at the start. Then by the end, "a"
>>> was removed and "z" added, so both "a" and "z" would appear in the
>>> emitted stream. Note that FoundationDB has APIs which exhibit the same
>>> "relaxed" constrains:
>>> https://apple.github.io/foundationdb/api-python.html#module-fdb.locality
>>> 

Reply via email to