This is a really important topic; thanks Nick for bringing it up. Sorry I didn’t comment earlier. I think Mike neatly captures my perspective with this bit:
>> Our current behaviour seems extremely subtle and, I'd argue, unexpected. It >> is hard to reason about if you really need a particular guarantee. >> >> Is there an opportunity to clarify behaviour here, such that we really _do_ >> guarantee point-in-time within _any_ single request, but only do this by >> leveraging FoundationDB's transaction isolation semantics and as such are >> only able to offer this based on the 5s timeout in place? The request >> boundary offers a very clear cut, user-visible boundary. This would >> obviously need to cover reads/writes of single docs and so on as well as >> probably needing further work w.r.t. bulk docs etc. >> >> This restriction may naturally loosen as FoundationDB improves and the 5s >> timeout may be increased. It’d be great if we could agree on this use of serializable snapshot isolation under the hood for each response to a CouchDB API request (excepting _changes) as an optimal state. Of course, we have this complicating factor of an existing API and a community of users running applications in production against that API :) As you can imagine from the above, I’d be opposed to A); I think that squanders a real opportunity that we have here with a new major version. I also think that the return on investment for F) is too low; a large portion of our production databases see a 24/7 write load so a code path that only activates when a DB is quiesced doesn’t get my vote. When I look at the other options, I think it’s important to take a broader view and consider the user experience in the client libraries as well as the API. Our experience at IBM Cloud is that a large majority of API requests come from a well-defined set of client libraries, and as we consider non-trivial changes to the API we can look to those libraries as a way to smooth over the API breakage, and intelligently surface new capabilities even if the least-disruptive way to introduce them to the API is a bit janky. As a concrete example, I would support an aggressive ceiling on `limit` and `skip` in the 4.0 API, while enhancing popular client libraries as needed to allow users to opt-in to automatic pagination through larger result sets. Nick rightly points out that we don’t have a good way to declare a read version timeout when we’ve already streamed a portion of the result set to the client, which is something we ought consider even if we do apply the restrictions in E). I acknowledge that I may be opening a can of worms, but ... how much value do we derive from that streaming behavior if we aggressively limit the `limit`? We wouldn’t be holding that much data in memory on the CouchDB side, and I don’t think many of our clients are parsing half-completed JSON objects for anything beyond the _changes feed. Something to think about. Cheers, Adam > On Feb 25, 2020, at 2:52 PM, Nick Vatamaniuc <vatam...@gmail.com> wrote: > > Hi Mike, > > Good point about CouchDB not actually providing point-in-time > snapshots. I missed those cases when thinking about it. > > I wonder if that points to defaulting to option A since it maintains > the API compatibility and doesn't loosen the current constraints > anyway. At least it will un-break the current version of the branch > until we figure out something better. Otherwise it's completely > unusable for dbs with more than 200-300k documents. > > I like the idea of returning a bookmark and a completed/not-completed > flag. That is, it would be option D for _all_docs and map-reduce > views, but instead of the complex continuation object it would be a > base64-encoded, opaque object. Passing a bookmark back in as a > parameter would be exclusive to passing in a start, end, skip, limit, > and direction (forward/reverse) parameters. For _all_dbs, and > _dbs_info where we don't have a place for metadata rows, we might need > a new API endpoint. And maybe that opens the door to expose more > transactional features in the API in general... > > Also, it seems B, C and F have too many corner cases and > inconsistencies so they can probably be discarded, unless someone > disagrees. > > Configurable skips and limit maximums (E) may still be interesting. > Though, they don't necessarily have to be related to transactions, but > can instead be used to ensure streaming APIs are consumed in smaller > chunks. > > Cheers, > -Nick > > > > On Mon, Feb 24, 2020 at 7:26 AM Mike Rhodes <couc...@dx13.co.uk> wrote: >> >> Nick, >> >> Thanks for thinking this through, it's certainly subtle and very unclear >> what is a "good" solution :( >> >> I have a couple of thoughts, firstly about the guarantees we currently offer >> and then wondering whether there is an opportunity to improve our API by >> offering a single guarantee across all request types rather than bifurcating >> guarantees. >> >> --- >> >> The first point is that, by my reasoning, CouchDB 2.x doesn't actually don't >> offer a point-in-time guarantee of the following sort currently. I read this >> as your saying Couch does offer this guarantee, apologies if I'm misreading: >> >>> Document the API behavior change that it may >>> present a view of the data is never a point-in-time[4] snapshot of the >>> DB. >> ... >>> [4] For example they have a constraint that documents "a" and "z" >>> cannot both be in the database at the same time. But when iterating >>> it's possible that "a" was there at the start. Then by the end, "a" >>> was removed and "z" added, so both "a" and "z" would appear in the >>> emitted stream. Note that FoundationDB has APIs which exhibit the same >>> "relaxed" constrains: >>> https://apple.github.io/foundationdb/api-python.html#module-fdb.locality >> >> I don't believe we offer this guarantee because different database shards >> will respond to the scatter-gather inherent to a single global query type >> request at different times. This means that, given the following sequence of >> events: >> >> (1) The shard containing "a" may start returning at time N. >> (2) "a" may be deleted at N+1, but (1) will still be streaming from time N. >> (3) "z" may be written to a second shard at time N+2. >> (4) that second shard may not start returning until time N+3. >> >> By my reasoning, "a" and "z" could thus appear in the same result set in >> current CouchDB, even if they never actually appear in the primary data at >> the same time (regardless of latency of shard replicas coming into >> agreement), voiding [4]. >> >> By my reckoning, you have point-in-time across a query request when you are >> working with a single shard, meaning we do have point in time for two >> scenarios: >> >> - Partitioned queries. >> - Q=1 databases. >> >> Albeit this guarantee is still talking about the point in time of a single >> shard's replica rather than all replicas, meaning that further requests may >> produce different results if the shards are not in agreement. Which can >> perhaps be fixed by using stable=true. >> >> I _think_ the working here is correct, but I'd welcome corrections in my >> understanding! >> >> --- >> >> Our current behaviour seems extremely subtle and, I'd argue, unexpected. It >> is hard to reason about if you really need a particular guarantee. >> >> Is there an opportunity to clarify behaviour here, such that we really _do_ >> guarantee point-in-time within _any_ single request, but only do this by >> leveraging FoundationDB's transaction isolation semantics and as such are >> only able to offer this based on the 5s timeout in place? The request >> boundary offers a very clear cut, user-visible boundary. This would >> obviously need to cover reads/writes of single docs and so on as well as >> probably needing further work w.r.t. bulk docs etc. >> >> This restriction may naturally loosen as FoundationDB improves and the 5s >> timeout may be increased. >> >> In this approach, my preference would be to add a closing line to the result >> stream to contain both a bookmark (based on the FoundationDB key perhaps >> rather than the index key of itself to avoid problems with skip/limit?) and >> a complete/not-complete boolean to enable clients to avoid the extra HTTP >> round-trip for completed result sets that Nick mentions. >> >> --- >> >> For option (F), I feel that the "it sometimes works and sometimes doesn't" >> effect of checking the update-seq to see if we can continue streaming will >> be a confusing experience. I also find something similar with option (A) >> where a single request covers potentially many points in time and so feels >> hard to reason about, although it's a bit less subtle than today. >> >> Footnote [2] seems quite a major problem, however, with the single >> transaction approach and as Nick says, it is hard to pick "good" maximums >> for skip -- perhaps users need to just avoid use of these in the new system >> given its behaviour? It feels like there's a definite "against the grain" >> aspect to these. >> >> -- >> Mike. >> >> On Wed, 19 Feb 2020, at 22:39, Nick Vatamaniuc wrote: >>> Hello everyone, >>> >>> I'd like to discuss the shape and behavior of streaming APIs for CouchDB 4.x >>> >>> By "streaming APIs" I mean APIs which stream data in row as it gets >>> read from the database. These are the endpoints I was thinking of: >>> >>> _all_docs, _all_dbs, _dbs_info and query results >>> >>> I want to focus on what happens when FoundationDB transactions >>> time-out after 5 seconds. Currently, all those APIs except _changes[1] >>> feeds, will crash or freeze. The reason is because the >>> transaction_too_old error at the end of 5 seconds is retry-able by >>> default, so the request handlers run again and end up shoving the >>> whole request down the socket again, headers and all, which is >>> obviously broken and not what we want. >>> >>> There are few alternatives discussed in couchdb-dev channel. I'll >>> present some behaviors but feel free to add more. Some ideas might >>> have been discounted on the IRC discussion already but I'll present >>> them anyway in case is sparks further conversation: >>> >>> A) Do what _changes[1] feeds do. Start a new transaction and continue >>> streaming the data from the next key after last emitted in the >>> previous transaction. Document the API behavior change that it may >>> present a view of the data is never a point-in-time[4] snapshot of the >>> DB. >>> >>> - Keeps the API shape the same as CouchDB <4.0. Client libraries >>> don't have to change to continue using these CouchDB 4.0 endpoints >>> - This is the easiest to implement since it would re-use the >>> implementation for _changes feed (an extra option passed to the fold >>> function). >>> - Breaks API behavior if users relied on having a point-in-time[4] >>> snapshot view of the data. >>> >>> B) Simply end the stream. Let the users pass a `?transaction=true` >>> param which indicates they are aware the stream may end early and so >>> would have to paginate from the last emitted key with a skip=1. This >>> will keep the request bodies the same as current CouchDB. However, if >>> the users got all the data one request, they will end up wasting >>> another request to see if there is more data available. If they didn't >>> get any data they might have a too large of a skip value (see [2]) so >>> would have to guess different values for start/end keys. Or impose max >>> limit for the `skip` parameter. >>> >>> C) End the stream and add a final metadata row like a "transaction": >>> "timeout" at the end. That will let the user know to keep paginating >>> from the last key onward. This won't work for `_all_dbs` and >>> `_dbs_info`[3] Maybe let those two endpoints behave like _changes >>> feeds and only use this for views and and _all_docs? If we like this >>> choice, let's think what happens for those as I couldn't come up with >>> anything decent there. >>> >>> D) Same as C but to solve the issue with skips[2], emit a bookmark >>> "key" of where the iteration stopped and the current "skip" and >>> "limit" params, which would keep decreasing. Then user would pass >>> those in "start_key=..." in the next request along with the limit and >>> skip params. So something like "continuation":{"skip":599, "limit":5, >>> "key":"..."}. This has the same issue with array results for >>> `_all_dbs` and `_dbs_info`[3]. >>> >>> E) Enforce low `limit` and `skip` parameters. Enforce maximum values >>> there such that response time is likely to fit in one transaction. >>> This could be tricky as different runtime environments will have >>> different characteristics. Also, if the timeout happens there isn't a >>> a nice way to send an HTTP error since we already sent the 200 >>> response. The downside is that this might break how some users use the >>> API, if say the are using large skips and limits already. Perhaps here >>> we do both B and D, such that if users want transactional behavior, >>> they specify that `transaction=true` param and only then we enforce >>> low limit and skip maximums. >>> >>> F) At least for `_all_docs` it seems providing a point-in-time >>> snapshot view doesn't necessarily need to be tied to transaction >>> boundaries. We could check the update sequence of the database at the >>> start of the next transaction and if it hasn't changed we can continue >>> emitting a consistent view. This can apply to C and D and would just >>> determine when the stream ends. If there are no writes happening to >>> the db, this could potential streams all the data just like option A >>> would do. Not entirely sure if this would work for views. >>> >>> So what do we think? I can see different combinations of options here, >>> maybe even different for each API point. For example `_all_dbs`, >>> `_dbs_info` are always A, and `_all_docs` and views default to A but >>> have parameters to do F, etc. >>> >>> Cheers, >>> -Nick >>> >>> Some footnotes: >>> >>> [1] _changes feeds is the only one that works currently. It behaves as >>> per RFC >>> https://github.com/apache/couchdb-documentation/blob/master/rfcs/003-fdb-seq-index.md#access-patterns. >>> That is, we continue streaming the data by resetting the transaction >>> object and restarting from the last emitted key (db sequence in this >>> case). However, because the transaction restarts if a document is >>> updated while the streaming take place, it may appear in the _changes >>> feed twice. That's a behavior difference from CouchDB < 4.0 and we'd >>> have to document it, since previously we presented this point-in-time >>> snapshot of the database from when we started streaming. >>> >>> [2] Our streaming APIs have both skips and limits. Since FDB doesn't >>> currently support efficient offsets for key selectors >>> (https://apple.github.io/foundationdb/known-limitations.html#dont-use-key-selectors-for-paging) >>> we implemented skip by iterating over the data. This means that a skip >>> of say 100000 could keep timing out the transaction without yielding >>> any data. >>> >>> [3] _all_dbs and _dbs_info return a JSON array so they don't have an >>> obvious place to insert a last metadata row. >>> >>> [4] For example they have a constraint that documents "a" and "z" >>> cannot both be in the database at the same time. But when iterating >>> it's possible that "a" was there at the start. Then by the end, "a" >>> was removed and "z" added, so both "a" and "z" would appear in the >>> emitted stream. Note that FoundationDB has APIs which exhibit the same >>> "relaxed" constrains: >>> https://apple.github.io/foundationdb/api-python.html#module-fdb.locality >>>