Addendum: By “directory aliasing” I meant within a document (either the actual Directory thing or something equivalent of our own making). The directory aliasing for each database is a good way to reduce key size without a significant cost. Though if Redwood lands in time, even this would become an inutile obfuscation].
> On 19 Feb 2019, at 21:39, Robert Samuel Newson <rnew...@apache.org> wrote: > > Interesting suggestion, obviously the details might get the wrong kind of fun. > > Somewhere above I suggested this would be something we could change over time > and even use different approaches for different documents within the same > database. This is the long way of saying there are multiple ways to do this > each with advantages and none without disadvantages. > > I don’t think adding a layer of abstraction is the right move just yet, I > think we should continue to find consensus on one answer to this question > (and the related ones in other threads) for the first release. It’s easy to > say “we can change it later”, of course. We can, though it would be a chunk > of work in the context of something that already works, I’ve rarely seen > anyone sign up for that. > > I’m fine with the first proposal from Adam, where the keys are tuples of key > parts pointing at terminal values. To make it easier for the first version, I > would exclude optimisations like deduplication or the Directory aliasing or > the schema thing that I suggested and that Ilya incorporated a variant of in > a follow-up post. We’d accept that there are limits on the sizes of > documents, including the awkward-to-express one about property depth. > > Stepping back, I’m not seeing any essential improvement over Adam’s original > proposal besides the few corrections and clarifications made by various > authors. Could we start an RFC based on Adam’s original proposal on document > body, revision tree and index storage? We could then have PR’s against that > for each additional optimisation (one person’s optimisation is another > person’s needless complication)? > > If I’ve missed some genuine advance on the original proposal in this long > thread, please call it out for me. > > B. > >> On 19 Feb 2019, at 21:15, Benjamin Anderson <banjie...@apache.org> wrote: >> >> As is evident by the length of this thread, there's a pretty big >> design space to cover here, and it seems unlikely we'll have arrived >> at a "correct" solution even by the time this thing ships. Perhaps it >> would be worthwhile to treat the in-FDB representation of data as a >> first-class abstraction and support multiple representations >> simultaneously? >> >> Obviously there's no such thing as a zero-cost abstraction - and I've >> not thought very hard about how far up the stack the document >> representation would need to leak - but supporting different layouts >> (primarily, as Adam points out, on the document body itself) might >> prove interesting and useful. I'm sure there are folks interested in a >> column-shaped CouchDB, for example. >> >> -- >> b >> >> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnew...@apache.org> wrote: >>> >>> Good points on revtree, I agree with you we should store that intelligently >>> to gain the benefits you mentioned. >>> >>> -- >>> Robert Samuel Newson >>> rnew...@apache.org >>> >>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote: >>>> I do not think we should store the revtree as a blob. The design where >>>> each edit branch is its own KV should save on network IO and CPU cycles >>>> for normal updates. We’ve performed too many heroics to keep >>>> couch_key_tree from stalling entire databases when trying to update a >>>> single document with a wide revision tree, I would much prefer to ignore >>>> other edit branches entirely when all we’re doing is extending one of >>>> them. >>>> >>>> I also do not think we should store JSON documents as blobs, but it’s a >>>> closer call. Some of my reasoning for preferring the exploded path >>>> design: >>>> >>>> - it lends itself nicely to sub-document operations, for which Jan >>>> crafted an RFC last year: https://github.com/apache/couchdb/issues/1559 >>>> - it optimizes the creation of Mango indexes on existing databases since >>>> we only need to retrieve the value(s) we want to index >>>> - it optimizes Mango queries that use field selectors >>>> - anyone who wanted to try their hand at GraphQL will find it very >>>> handy: https://github.com/apache/couchdb/issues/1499 >>>> - looking further ahead, it lets us play with smarter leaf value types >>>> like Counters (yes I’m still on the CRDT bandwagon, sorry) >>>> >>>> A few comments on the thread: >>>> >>>>>>> * Most documents bodies are probably going to be smaller than 100k. So >>>>>>> in >>>>>>> the majority of case it would be one write / one read to update and >>>>>>> fetch >>>>>>> the document body. >>>> >>>> We should test, but I expect reading 50KB of data in a range query is >>>> almost as efficient as reading a single 50 KB value. Similarly, writes >>>> to a contiguous set of keys should be quite efficient. >>>> >>>> I am concerned about the overhead of the repeated field paths in the >>>> keys with the exploded path option in the absence of key prefix >>>> compression. That would be my main reason to acquiesce and throw away >>>> all the document structure. >>>> >>>> Adam >>>> >>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson <rnew...@apache.org> wrote: >>>>> >>>>> I like the idea that we'd reuse the same pattern (but perhaps not the >>>>> same _code_) for doc bodies, revtree and attachments. >>>>> >>>>> I hope we still get to delete couch_key_tree.erl, though. >>>>> >>>>> -- >>>>> Robert Samuel Newson >>>>> rnew...@apache.org >>>>> >>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote: >>>>>> I like the idea from a “trying a simple thing first” perspective, but >>>>>> Nick’s points below are especially convincing to with this for now. >>>>>> >>>>>> Best >>>>>> Jan >>>>>> — >>>>>> >>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Sorry for jumping in so late, I was following from the sidelines >>>>>>> mostly. A >>>>>>> lot of good discussion happening and am excited about the possibilities >>>>>>> here. >>>>>>> >>>>>>> I do like the simpler "chunking" approach for a few reasons: >>>>>>> >>>>>>> * Most documents bodies are probably going to be smaller than 100k. So >>>>>>> in >>>>>>> the majority of case it would be one write / one read to update and >>>>>>> fetch >>>>>>> the document body. >>>>>>> >>>>>>> * We could reuse the chunking code for attachment handling and possibly >>>>>>> revision key trees. So it's the general pattern of upload chunks to some >>>>>>> prefix, and when finished flip an atomic toggle to make it current. >>>>>>> >>>>>>> * Do the same thing with revision trees and we could re-use the revision >>>>>>> tree manipulation logic. That is, the key tree in most cases would be >>>>>>> small >>>>>>> enough to fit in 100k but if they get huge, they'd get chunked. This >>>>>>> would >>>>>>> allow us to reuse all the battle tested couch_key_tree code mostly as >>>>>>> is. >>>>>>> We even have property tests for it >>>>>>> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl >>>>>>> >>>>>>> * It removes the need to explain the max exploded path length >>>>>>> limitation to >>>>>>> customers. >>>>>>> >>>>>>> Cheers, >>>>>>> -Nick >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnew...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> An alternative storage model that we should seriously consider is to >>>>>>>> follow our current approach in couch_file et al. Specifically, that the >>>>>>>> document _body_ is stored as an uninterpreted binary value. This would >>>>>>>> be >>>>>>>> much like the obvious plan for attachment storage; a key prefix that >>>>>>>> identifies the database and document, with the final item of that key >>>>>>>> tuple >>>>>>>> is an incrementing integer. Each of those keys has a binary value of >>>>>>>> up to >>>>>>>> 100k. Fetching all values with that key prefix, in fdb's natural >>>>>>>> ordering, >>>>>>>> will yield the full document body, which can be JSON decoded for >>>>>>>> further >>>>>>>> processing. >>>>>>>> >>>>>>>> I like this idea, and I like Adam's original proposal to explode >>>>>>>> documents >>>>>>>> into property paths. I have a slight preference for the simplicity of >>>>>>>> the >>>>>>>> idea in the previous paragraph, not least because it's close to what >>>>>>>> we do >>>>>>>> today. I also think it will be possible to migrate to alternative >>>>>>>> storage >>>>>>>> models in future, and foundationdb's transaction supports means we can >>>>>>>> do >>>>>>>> this migration seamlessly should we come to it. >>>>>>>> >>>>>>>> I'm very interested in knowing if anyone else is interested in going >>>>>>>> this >>>>>>>> simple, or considers it a wasted opportunity relative to the 'exploded' >>>>>>>> path. >>>>>>>> >>>>>>>> B. >>>>>>>> >>>>>>>> -- >>>>>>>> Robert Samuel Newson >>>>>>>> rnew...@apache.org >>>>>>>> >>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote: >>>>>>>>> I've been remiss here in not posting the data model ideas that IBM >>>>>>>>> worked up while we were thinking about using FoundationDB so I'm >>>>>>>>> posting >>>>>>>>> it now. This is Adam' Kocoloski's original work, I am just >>>>>>>>> transcribing >>>>>>>>> it, and this is the context that the folks from the IBM side came in >>>>>>>>> with, for full disclosure. >>>>>>>>> >>>>>>>>> Basics >>>>>>>>> >>>>>>>>> 1. All CouchDB databases are inside a Directory >>>>>>>>> 2. Each CouchDB database is a Directory within that Directory >>>>>>>>> 3. It's possible to list all subdirectories of a Directory, so >>>>>>>>> `_all_dbs` is the list of directories from 1. >>>>>>>>> 4. Each Directory representing a CouchdB database has several >>>>>>>>> Subspaces; >>>>>>>>> 4a. by_id/ doc subspace: actual document contents >>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed >>>>>>>>> 4c. index_definitions, indexes, ... >>>>>>>>> >>>>>>>>> JSON Mapping >>>>>>>>> >>>>>>>>> A hierarchical JSON object naturally maps to multiple KV pairs in FDB: >>>>>>>>> >>>>>>>>> { >>>>>>>>> “_id”: “foo”, >>>>>>>>> “owner”: “bob”, >>>>>>>>> “mylist”: [1,3,5], >>>>>>>>> “mymap”: { >>>>>>>>> “blue”: “#0000FF”, >>>>>>>>> “red”: “#FF0000” >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> maps to >>>>>>>>> >>>>>>>>> (“foo”, “owner”) = “bob” >>>>>>>>> (“foo”, “mylist”, 0) = 1 >>>>>>>>> (“foo”, “mylist”, 1) = 3 >>>>>>>>> (“foo”, “mylist”, 2) = 5 >>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF” >>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000” >>>>>>>>> >>>>>>>>> NB: this means that the 100KB limit applies to individual leafs in the >>>>>>>>> JSON object, not the entire doc >>>>>>>>> >>>>>>>>> Edit Conflicts >>>>>>>>> >>>>>>>>> We need to account for the presence of conflicts in various levels of >>>>>>>>> the doc due to replication. >>>>>>>>> >>>>>>>>> Proposal is to create a special value indicating that the subtree >>>>>>>>> below >>>>>>>>> our current cursor position is in an unresolvable conflict. Then add >>>>>>>>> additional KV pairs below to describe the conflicting entries. >>>>>>>>> >>>>>>>>> KV data model allows us to store these efficiently and minimize >>>>>>>>> duplication of data: >>>>>>>>> >>>>>>>>> A document with these two conflicts: >>>>>>>>> >>>>>>>>> { >>>>>>>>> “_id”: “foo”, >>>>>>>>> “_rev”: “1-abc”, >>>>>>>>> “owner”: “alice”, >>>>>>>>> “active”: true >>>>>>>>> } >>>>>>>>> { >>>>>>>>> “_id”: “foo”, >>>>>>>>> “_rev”: “1-def”, >>>>>>>>> “owner”: “bob”, >>>>>>>>> “active”: true >>>>>>>>> } >>>>>>>>> >>>>>>>>> could be stored thus: >>>>>>>>> >>>>>>>>> (“foo”, “active”) = true >>>>>>>>> (“foo”, “owner”) = kCONFLICT >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>>>>> (“foo”, “owner”, “1-def”) = “bob” >>>>>>>>> >>>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting subtree >>>>>>>>> this >>>>>>>>> representation can handle conflicts of different data types as well. >>>>>>>>> >>>>>>>>> Missing fields need to be handled explicitly: >>>>>>>>> >>>>>>>>> { >>>>>>>>> “_id”: “foo”, >>>>>>>>> “_rev”: “1-abc”, >>>>>>>>> “owner”: “alice”, >>>>>>>>> “active”: true >>>>>>>>> } >>>>>>>>> >>>>>>>>> { >>>>>>>>> “_id”: “foo”, >>>>>>>>> “_rev”: “1-def”, >>>>>>>>> “owner”: { >>>>>>>>> “name”: “bob”, >>>>>>>>> “email”: “ >>>>>>>>> b...@example.com >>>>>>>>> " >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> could be stored thus: >>>>>>>>> >>>>>>>>> (“foo”, “active”) = kCONFLICT >>>>>>>>> (“foo”, “active”, “1-abc”) = true >>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING >>>>>>>>> (“foo”, “owner”) = kCONFLICT >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” >>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ... >>>>>>>>> >>>>>>>>> Revision Metadata >>>>>>>>> >>>>>>>>> * CouchDB uses a hash history for revisions >>>>>>>>> ** Each edit is identified by the hash of the content of the edit >>>>>>>>> including the base revision against which it was applied >>>>>>>>> ** Individual edit branches are bounded in length but the number of >>>>>>>>> branches is potentially unbounded >>>>>>>>> >>>>>>>>> * Size limits preclude us from storing the entire key tree as a single >>>>>>>>> value; in pathological situations >>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes) >>>>>>>>> >>>>>>>>> * Store each edit branch as a separate KV including deleted status in >>>>>>>>> a >>>>>>>>> special subspace >>>>>>>>> >>>>>>>>> * Structure key representation so that “winning” revision can be >>>>>>>>> automatically retrieved in a limit=1 >>>>>>>>> key range operation >>>>>>>>> >>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = [] >>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = >>>>>>>>> [“3-baz”,”2-bar”,”1-foo”] >>>>>>>>> <-- winner >>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”] >>>>>>>>> >>>>>>>>> Changes Feed >>>>>>>>> >>>>>>>>> * FDB supports a concept called a versionstamp — a 10 byte, unique, >>>>>>>>> monotonically (but not sequentially) increasing value for each >>>>>>>>> committed >>>>>>>>> transaction. The first 8 bytes are the committed version of the >>>>>>>>> database. The last 2 bytes are monotonic in the serialization order >>>>>>>>> for >>>>>>>>> transactions. >>>>>>>>> >>>>>>>>> * A transaction can specify a particular index into a key where the >>>>>>>>> following 10 bytes will be overwritten by the versionstamp at commit >>>>>>>>> time >>>>>>>>> >>>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes feed >>>>>>>>> >>>>>>>>> by_seq subspace >>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”) >>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”) >>>>>>>>> >>>>>>>>> by_id subspace >>>>>>>>> (“bar”, “_vsn”) = “versionstamp4” >>>>>>>>> ... >>>>>>>>> (“foo”, “_vsn”) = “versionstamp1” >>>>>>>>> >>>>>>>>> JSON Indexes >>>>>>>>> >>>>>>>>> * “Mango” JSON indexes are defined by >>>>>>>>> ** a list of field names, each of which may be nested, >>>>>>>>> ** an optional partial_filter_selector which constrains the set of >>>>>>>>> docs >>>>>>>>> that contribute >>>>>>>>> ** an optional name defined by the ddoc field (the name is auto- >>>>>>>>> generated if not supplied) >>>>>>>>> >>>>>>>>> * Store index definitions in a single subspace to aid query planning >>>>>>>>> ** ((person,name), title, email) = (“name-title-email”, “{“student”: >>>>>>>>> true}”) >>>>>>>>> ** Store the values for each index in a dedicated subspace, adding the >>>>>>>>> document ID as the last element in the tuple >>>>>>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) = null >>>>>>>>> >>>>>>>>> B. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Robert Samuel Newson >>>>>>>>> rnew...@apache.org >>>>>>>>> >>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote: >>>>>>>>>> >>>>>>>>>> I want to fix previous mistakes. I did two mistakes in previous >>>>>>>>>> calculations: >>>>>>>>>> - I used 1Kb as base size for calculating expansion factor (although >>>>>>>> we >>>>>>>>>> don't know exact size of original document) >>>>>>>>>> - The expansion factor calculation included number of revisions (it >>>>>>>>>> shouldn't) >>>>>>>>>> >>>>>>>>>> I'll focus on flattened JSON docs model >>>>>>>>>> >>>>>>>>>> The following formula is used in previous calculation. >>>>>>>>>> storage_size_per_document=mapping_table_size*number_of_revisions + >>>>>>>>>> depth*number_of_paths*number_of_revisions + >>>>>>>>>> number_of_paths*value_size*number_of_revisions >>>>>>>>>> >>>>>>>>>> To clarify things a little bit I want to calculate space requirement >>>>>>>> for >>>>>>>>>> single revision this time. >>>>>>>>>> mapping_table_size=field_name_size*(field_name_length+4(integer >>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes >>>>>>>>>> storage_size_per_document_per_revision_per_replica=mapping_table_size >>>>>>>> + >>>>>>>>>> depth*number_of_paths + value_size*number_of_paths = >>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb >>>>>>>>>> >>>>>>>>>> We definitely can reduce requirement for mapping table by adopting >>>>>>>>>> rnewson's idea of a schema. >>>>>>>>>> >>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> wrote: >>>>>>>>>>> Hi Michael, >>>>>>>>>>> >>>>>>>>>>>> For example, hears a crazy thought: >>>>>>>>>>>> Map every distinct occurence of a key/value instance through a >>>>>>>> crypto hash >>>>>>>>>>>> function to get a set of hashes. >>>>>>>>>>>> >>>>>>>>>>>> These can be be precomputed by Couch without any lookups in FDB. >>>>>>>> These >>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend >>>>>>>> themselves to >>>>>>>>>>>> range search well. >>>>>>>>>>>> >>>>>>>>>>>> So what you do is index them for frequency of occurring in the >>>>>>>> same set. >>>>>>>>>>>> In essence, you 'bucket them' statistically, and that bucket id >>>>>>>> becomes a >>>>>>>>>>>> key prefix. A crypto hash value can be copied into more than one >>>>>>>> bucket. >>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} >>>>>>>>>>> >>>>>>>>>>>> When writing a document, Couch submits the list/array of >>>>>>>> cryptohash values >>>>>>>>>>>> it computed to FDB and gets back the corresponding {val_id} (the >>>>>>>> id with >>>>>>>>>>>> the bucket prefixed). This can get somewhat expensive if there's >>>>>>>> always a >>>>>>>>>>>> lot of app local cache misses. >>>>>>>>>>>> >>>>>>>>>>>> A document's value is then a series of {val_id} arrays up to 100k >>>>>>>> per >>>>>>>>>>>> segment. >>>>>>>>>>>> >>>>>>>>>>>> When retrieving a document, you get the val_ids, find the distinct >>>>>>>> buckets >>>>>>>>>>>> and min/max entries for this doc, and then parallel query each >>>>>>>> bucket while >>>>>>>>>>>> reconstructing the document. >>>>>>>>>>> >>>>>>>>>>> Interesting idea. Let's try to think it through to see if we can >>>>>>>> make it viable. >>>>>>>>>>> Let's go through hypothetical example. Input data for the example: >>>>>>>>>>> - 1M of documents >>>>>>>>>>> - each document is around 10Kb >>>>>>>>>>> - each document consists of 1K of unique JSON paths >>>>>>>>>>> - each document has 100 unique JSON field names >>>>>>>>>>> - every scalar value is 100 bytes >>>>>>>>>>> - 10% of unique JSON paths for every document already stored in >>>>>>>> database under different doc or different revision of the current one >>>>>>>>>>> - we assume 3 independent copies for every key-value pair in FDB >>>>>>>>>>> - our hash key size is 32 bytes >>>>>>>>>>> - let's assume we can determine if key is already on the storage >>>>>>>> without doing query >>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real live the >>>>>>>> percentage is lower) >>>>>>>>>>> - every JSON field name is 20 bytes >>>>>>>>>>> - every JSON path is 10 levels deep >>>>>>>>>>> - document key prefix length is 50 >>>>>>>>>>> - every document has 10 revisions >>>>>>>>>>> Let's estimate the storage requirements and size of data we need to >>>>>>>> transmit. The calculations are not exact. >>>>>>>>>>> 1. storage_size_per_document (we cannot estimate exact numbers since >>>>>>>> we don't know how FDB stores it) >>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = >>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x) >>>>>>>>>>> 2. number of independent keys to retrieve on document read >>>>>>>> (non-range queries) per document >>>>>>>>>>> - 1K - (1K * 1%) = 990 >>>>>>>>>>> 3. number of range queries: 0 >>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 >>>>>>>> bytes) = 102 Kb (10x) >>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from >>>>>>>> https://apple.github.io/foundationdb/performance.html) >>>>>>>>>>> - sequential: 990*2ms = 1980ms >>>>>>>>>>> - range: 0 >>>>>>>>>>> Let's compare these numbers with initial proposal (flattened JSON >>>>>>>> docs without global schema and without cache) >>>>>>>>>>> 1. storage_size_per_document >>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes >>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes >>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = >>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x) >>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on index >>>>>>>> structure) >>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result) >>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb >>>>>>>> (2.4x) >>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from >>>>>>>> https://apple.github.io/foundationdb/performance.html and estimate >>>>>>>> range >>>>>>>> read performance based on numbers from >>>>>>>> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test >>>>>>>> ) >>>>>>>>>>> - range read performance: Given read performance is about 305,000 >>>>>>>> reads/second and range performance 3,600,000 keys/second we estimate >>>>>>>> range >>>>>>>> performance to be 11.8x compared to read performance. If read >>>>>>>> performance >>>>>>>> is 2ms than range performance is 0.169ms (which is hard to believe). >>>>>>>>>>> - sequential: 2 * 2 = 4ms >>>>>>>>>>> - range: 0.169 >>>>>>>>>>> >>>>>>>>>>> It looks like we are dealing with a tradeoff: >>>>>>>>>>> - Map every distinct occurrence of a key/value instance through a >>>>>>>> crypto hash: >>>>>>>>>>> - 5.39x more disk space efficient >>>>>>>>>>> - 474x slower >>>>>>>>>>> - flattened JSON model >>>>>>>>>>> - 5.39x less efficient in disk space >>>>>>>>>>> - 474x faster >>>>>>>>>>> >>>>>>>>>>> In any case this unscientific exercise was very helpful. Since it >>>>>>>> uncovered the high cost in terms of disk space. 59.3x of original disk >>>>>>>> size >>>>>>>> is too much IMO. >>>>>>>>>>> >>>>>>>>>>> Are the any ways we can make Michael's model more performant? >>>>>>>>>>> >>>>>>>>>>> Also I don't quite understand few aspects of the global hash table >>>>>>>> proposal: >>>>>>>>>>> >>>>>>>>>>> 1. > - Map every distinct occurence of a key/value instance through >>>>>>>> a crypto hash function to get a set of hashes. >>>>>>>>>>> I think we are talking only about scalar values here? I.e. >>>>>>>> `"#/foo.bar.baz": 123` >>>>>>>>>>> Since I don't know how we can make it work for all possible JSON >>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}": >>>>>>>>>>> - foo >>>>>>>>>>> - foo.bar >>>>>>>>>>> - foo.bar.baz >>>>>>>>>>> >>>>>>>>>>> 2. how to delete documents >>>>>>>>>>> >>>>>>>>>>> Best regards, >>>>>>>>>>> ILYA >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair <mich...@daclubhouse.net> >>>>>>>> wrote: >>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocol...@apache.org >>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Michael, >>>>>>>>>>>>> >>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY. >>>>>>>>>>>>> >>>>>>>>>>>>> Yes that’s definitely one way to address storage of edit >>>>>>>> conflicts. I >>>>>>>>>>>>> think there are other, more compact representations that we can >>>>>>>> explore if >>>>>>>>>>>>> we have this “exploded” data model where each scalar value maps >>>>>>>> to an >>>>>>>>>>>>> individual KV pair. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I agree, as I mentioned on the original thread, I see a scheme, >>>>>>>> that >>>>>>>>>>>> handles both conflicts and revisions, where you only have to store >>>>>>>> the most >>>>>>>>>>>> recent change to a field. Like you suggested, multiple revisions >>>>>>>> can share >>>>>>>>>>>> a key. Which in my mind's eye further begs the conflicts/revisions >>>>>>>>>>>> discussion along with the working within the limits discussion >>>>>>>> because it >>>>>>>>>>>> seems to me they are all intrinsically related as a "feature". >>>>>>>>>>>> >>>>>>>>>>>> Saying 'We'll break documents up into roughly 80k segments', then >>>>>>>> trying to >>>>>>>>>>>> overlay some kind of field sharing scheme for revisions/conflicts >>>>>>>> doesn't >>>>>>>>>>>> seem like it will work. >>>>>>>>>>>> >>>>>>>>>>>> I probably should have left out the trivial fix proposal as I >>>>>>>> don't think >>>>>>>>>>>> it's a feasible solution to actually use. >>>>>>>>>>>> >>>>>>>>>>>> The comment is more regarding that I do not see how this thread >>>>>>>> can escape >>>>>>>>>>>> including how to store/retrieve conflicts/revisions. >>>>>>>>>>>> >>>>>>>>>>>> For instance, the 'doc as individual fields' proposal lends itself >>>>>>>> to value >>>>>>>>>>>> sharing across mutiple documents (and I don't just mean revisions >>>>>>>> of the >>>>>>>>>>>> same doc, I mean the same key/value instance could be shared for >>>>>>>> every >>>>>>>>>>>> document). >>>>>>>>>>>> However that's not really relevant if we're not considering the >>>>>>>> amount of >>>>>>>>>>>> shared information across documents in the storage scheme. >>>>>>>>>>>> >>>>>>>>>>>> Simply storing documents in <100k segments (perhaps in some kind of >>>>>>>>>>>> compressed binary representation) to deal with that FDB limit >>>>>>>> seems fine. >>>>>>>>>>>> The only reason to consider doing something else is because of its >>>>>>>> impact >>>>>>>>>>>> to indexing, searches, reduce functions, revisions, on-disk size >>>>>>>> impact, >>>>>>>>>>>> etc. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> I'm assuming the process will flatten the key paths of the >>>>>>>> document into >>>>>>>>>>>>> an array and then request the value of each key as multiple >>>>>>>> parallel >>>>>>>>>>>>> queries against FDB at once >>>>>>>>>>>>> >>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s trying >>>>>>>> to design a >>>>>>>>>>>>> model which allows the retrieval of a document with a single >>>>>>>> range read, >>>>>>>>>>>>> which is a good goal in my opinion. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I am not sure I agree. >>>>>>>>>>>> >>>>>>>>>>>> Think of bitTorrent, a single range read should pull back the >>>>>>>> structure of >>>>>>>>>>>> the document (the pieces to fetch), but not necessarily the whole >>>>>>>> document. >>>>>>>>>>>> >>>>>>>>>>>> What if you already have a bunch of pieces in common with other >>>>>>>> documents >>>>>>>>>>>> locally (a repeated header/footer/ or type for example); and you >>>>>>>> only need >>>>>>>>>>>> to get a few pieces of data you don't already have? >>>>>>>>>>>> >>>>>>>>>>>> The real goal to Couch I see is to treat your document set like the >>>>>>>>>>>> collection of structured information that it is. In some respects >>>>>>>> like an >>>>>>>>>>>> extension of your application's heap space for structured objects >>>>>>>> and >>>>>>>>>>>> efficiently querying that collection to get back subsets of the >>>>>>>> data. >>>>>>>>>>>> >>>>>>>>>>>> Otherwise it seems more like a slightly upgraded file system plus >>>>>>>> a fancy >>>>>>>>>>>> grep/find like feature... >>>>>>>>>>>> >>>>>>>>>>>> The best way I see to unlock more features/power is to a move >>>>>>>> towards a >>>>>>>>>>>> more granular and efficient way to store and retrieve the scalar >>>>>>>> values... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> For example, hears a crazy thought: >>>>>>>>>>>> Map every distinct occurence of a key/value instance through a >>>>>>>> crypto hash >>>>>>>>>>>> function to get a set of hashes. >>>>>>>>>>>> >>>>>>>>>>>> These can be be precomputed by Couch without any lookups in FDB. >>>>>>>> These >>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend >>>>>>>> themselves to >>>>>>>>>>>> range search well. >>>>>>>>>>>> >>>>>>>>>>>> So what you do is index them for frequency of occurring in the >>>>>>>> same set. >>>>>>>>>>>> In essence, you 'bucket them' statistically, and that bucket id >>>>>>>> becomes a >>>>>>>>>>>> key prefix. A crypto hash value can be copied into more than one >>>>>>>> bucket. >>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} >>>>>>>>>>>> >>>>>>>>>>>> When writing a document, Couch submits the list/array of >>>>>>>> cryptohash values >>>>>>>>>>>> it computed to FDB and gets back the corresponding {val_id} (the >>>>>>>> id with >>>>>>>>>>>> the bucket prefixed). This can get somewhat expensive if there's >>>>>>>> always a >>>>>>>>>>>> lot of app local cache misses. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> A document's value is then a series of {val_id} arrays up to 100k >>>>>>>> per >>>>>>>>>>>> segment. >>>>>>>>>>>> >>>>>>>>>>>> When retrieving a document, you get the val_ids, find the distinct >>>>>>>> buckets >>>>>>>>>>>> and min/max entries for this doc, and then parallel query each >>>>>>>> bucket while >>>>>>>>>>>> reconstructing the document. >>>>>>>>>>>> >>>>>>>>>>>> The values returned from the buckets query are the key/value >>>>>>>> strings >>>>>>>>>>>> required to reassemble this document. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ---------- >>>>>>>>>>>> I put this forward primarily to hilite the idea that trying to >>>>>>>> match the >>>>>>>>>>>> storage representation of documents in a straight forward way to >>>>>>>> FDB keys >>>>>>>>>>>> to reduce query count might not be the most performance oriented >>>>>>>> approach. >>>>>>>>>>>> >>>>>>>>>>>> I'd much prefer a storage approach that reduced data duplication >>>>>>>> and >>>>>>>>>>>> enabled fast sub-document queries. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This clearly falls in the realm of what people want the 'use case' >>>>>>>> of Couch >>>>>>>>>>>> to be/become. By giving Couch more access to sub-document >>>>>>>> queries, I could >>>>>>>>>>>> eventually see queries as complicated as GraphQL submitted to >>>>>>>> Couch and >>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple documents in a >>>>>>>> single >>>>>>>>>>>> application layer request. >>>>>>>>>>>> >>>>>>>>>>>> Hehe - one way to look at the database of Couch documents is that >>>>>>>> they are >>>>>>>>>>>> all conflict revisions of the single root empty document. What I >>>>>>>> mean be >>>>>>>>>>>> this is consider thinking of the entire document store as one >>>>>>>> giant DAG of >>>>>>>>>>>> key/value pairs. How even separate documents are still typically >>>>>>>> related to >>>>>>>>>>>> each other. For most applications there is a tremendous amount of >>>>>>>> data >>>>>>>>>>>> redundancy between docs and especially between revisions of those >>>>>>>> docs... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> And all this is a long way of saying "I think there could be a lot >>>>>>>> of value >>>>>>>>>>>> in assuming documents are 'assembled' from multiple queries to >>>>>>>> FDB, with >>>>>>>>>>>> local caching, instead of simply retrieved" >>>>>>>>>>>> >>>>>>>>>>>> Thanks, I hope I'm not the only outlier here thinking this way!? >>>>>>>>>>>> >>>>>>>>>>>> Mike :-) >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>> >>>> >