Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Benjamin Anderson Tue, 19 Feb 2019 14:16:57 -0800

> I don’t think adding a layer of abstraction is the right move just yet, I 
> think we should continue to find consensus on one answer to this question


Agree that the theorycrafting stage is not optimal for making
abstraction decisions, but I suspect it would be worthwhile somewhere
between prototyping and releasing. Adam's proposal does seem to me the
most appealing approach on the surface, and I don't see anyone signing
up to do the work to deliver an alternative concurrently.

--
ba

On Tue, Feb 19, 2019 at 1:43 PM Robert Samuel Newson <rnew...@apache.org> wrote:
>
> Addendum: By “directory aliasing” I meant within a document (either the 
> actual Directory thing or something equivalent of our own making). The 
> directory aliasing for each database is a good way to reduce key size without 
> a significant cost. Though if Redwood lands in time, even this would become 
> an inutile obfuscation].
>
> > On 19 Feb 2019, at 21:39, Robert Samuel Newson <rnew...@apache.org> wrote:
> >
> > Interesting suggestion, obviously the details might get the wrong kind of 
> > fun.
> >
> > Somewhere above I suggested this would be something we could change over 
> > time and even use different approaches for different documents within the 
> > same database. This is the long way of saying there are multiple ways to do 
> > this each with advantages and none without disadvantages.
> >
> > I don’t think adding a layer of abstraction is the right move just yet, I 
> > think we should continue to find consensus on one answer to this question 
> > (and the related ones in other threads) for the first release. It’s easy to 
> > say “we can change it later”, of course. We can, though it would be a chunk 
> > of work in the context of something that already works, I’ve rarely seen 
> > anyone sign up for that.
> >
> > I’m fine with the first proposal from Adam, where the keys are tuples of 
> > key parts pointing at terminal values. To make it easier for the first 
> > version, I would exclude optimisations like deduplication or the Directory 
> > aliasing or the schema thing that I suggested and that Ilya incorporated a 
> > variant of in a follow-up post. We’d accept that there are limits on the 
> > sizes of documents, including the awkward-to-express one about property 
> > depth.
> >
> > Stepping back, I’m not seeing any essential improvement over Adam’s 
> > original proposal besides the few corrections and clarifications made by 
> > various authors. Could we start an RFC based on Adam’s original proposal on 
> > document body, revision tree and index storage? We could then have PR’s 
> > against that for each additional optimisation (one person’s optimisation is 
> > another person’s needless complication)?
> >
> > If I’ve missed some genuine advance on the original proposal in this long 
> > thread, please call it out for me.
> >
> > B.
> >
> >> On 19 Feb 2019, at 21:15, Benjamin Anderson <banjie...@apache.org> wrote:
> >>
> >> As is evident by the length of this thread, there's a pretty big
> >> design space to cover here, and it seems unlikely we'll have arrived
> >> at a "correct" solution even by the time this thing ships. Perhaps it
> >> would be worthwhile to treat the in-FDB representation of data as a
> >> first-class abstraction and support multiple representations
> >> simultaneously?
> >>
> >> Obviously there's no such thing as a zero-cost abstraction - and I've
> >> not thought very hard about how far up the stack the document
> >> representation would need to leak - but supporting different layouts
> >> (primarily, as Adam points out, on the document body itself) might
> >> prove interesting and useful. I'm sure there are folks interested in a
> >> column-shaped CouchDB, for example.
> >>
> >> --
> >> b
> >>
> >> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnew...@apache.org> wrote:
> >>>
> >>> Good points on revtree, I agree with you we should store that 
> >>> intelligently to gain the benefits you mentioned.
> >>>
> >>> --
> >>> Robert Samuel Newson
> >>> rnew...@apache.org
> >>>
> >>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote:
> >>>> I do not think we should store the revtree as a blob. The design where
> >>>> each edit branch is its own KV should save on network IO and CPU cycles
> >>>> for normal updates. We’ve performed too many heroics to keep
> >>>> couch_key_tree from stalling entire databases when trying to update a
> >>>> single document with a wide revision tree, I would much prefer to ignore
> >>>> other edit branches entirely when all we’re doing is extending one of
> >>>> them.
> >>>>
> >>>> I also do not think we should store JSON documents as blobs, but it’s a
> >>>> closer call. Some of my reasoning for preferring the exploded path
> >>>> design:
> >>>>
> >>>> - it lends itself nicely to sub-document operations, for which Jan
> >>>> crafted an RFC last year: https://github.com/apache/couchdb/issues/1559
> >>>> - it optimizes the creation of Mango indexes on existing databases since
> >>>> we only need to retrieve the value(s) we want to index
> >>>> - it optimizes Mango queries that use field selectors
> >>>> - anyone who wanted to try their hand at GraphQL will find it very
> >>>> handy: https://github.com/apache/couchdb/issues/1499
> >>>> - looking further ahead, it lets us play with smarter leaf value types
> >>>> like Counters (yes I’m still on the CRDT bandwagon, sorry)
> >>>>
> >>>> A few comments on the thread:
> >>>>
> >>>>>>> * Most documents bodies are probably going to be smaller than 100k. 
> >>>>>>> So in
> >>>>>>> the majority of case it would be one write / one read to update and 
> >>>>>>> fetch
> >>>>>>> the document body.
> >>>>
> >>>> We should test, but I expect reading 50KB of data in a range query is
> >>>> almost as efficient as reading a single 50 KB value. Similarly, writes
> >>>> to a contiguous set of keys should be quite efficient.
> >>>>
> >>>> I am concerned about the overhead of the repeated field paths in the
> >>>> keys with the exploded path option in the absence of key prefix
> >>>> compression. That would be my main reason to acquiesce and throw away
> >>>> all the document structure.
> >>>>
> >>>> Adam
> >>>>
> >>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson <rnew...@apache.org> wrote:
> >>>>>
> >>>>> I like the idea that we'd reuse the same pattern (but perhaps not the 
> >>>>> same _code_) for doc bodies, revtree and attachments.
> >>>>>
> >>>>> I hope we still get to delete couch_key_tree.erl, though.
> >>>>>
> >>>>> --
> >>>>> Robert Samuel Newson
> >>>>> rnew...@apache.org
> >>>>>
> >>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote:
> >>>>>> I like the idea from a “trying a simple thing first” perspective, but
> >>>>>> Nick’s points below are especially convincing to with this for now.
> >>>>>>
> >>>>>> Best
> >>>>>> Jan
> >>>>>> —
> >>>>>>
> >>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Sorry for jumping in so late, I was following from the sidelines 
> >>>>>>> mostly. A
> >>>>>>> lot of good discussion happening and am excited about the 
> >>>>>>> possibilities
> >>>>>>> here.
> >>>>>>>
> >>>>>>> I do like the simpler "chunking" approach for a few reasons:
> >>>>>>>
> >>>>>>> * Most documents bodies are probably going to be smaller than 100k. 
> >>>>>>> So in
> >>>>>>> the majority of case it would be one write / one read to update and 
> >>>>>>> fetch
> >>>>>>> the document body.
> >>>>>>>
> >>>>>>> * We could reuse the chunking code for attachment handling and 
> >>>>>>> possibly
> >>>>>>> revision key trees. So it's the general pattern of upload chunks to 
> >>>>>>> some
> >>>>>>> prefix, and when finished flip an atomic toggle to make it current.
> >>>>>>>
> >>>>>>> * Do the same thing with revision trees and we could re-use the 
> >>>>>>> revision
> >>>>>>> tree manipulation logic. That is, the key tree in most cases would be 
> >>>>>>> small
> >>>>>>> enough to fit in 100k but if they get huge, they'd get chunked. This 
> >>>>>>> would
> >>>>>>> allow us to reuse all the battle tested couch_key_tree code mostly as 
> >>>>>>> is.
> >>>>>>> We even have property tests for it
> >>>>>>> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl
> >>>>>>>
> >>>>>>> * It removes the need to explain the max exploded path length 
> >>>>>>> limitation to
> >>>>>>> customers.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> -Nick
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnew...@apache.org> 
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> An alternative storage model that we should seriously consider is to
> >>>>>>>> follow our current approach in couch_file et al. Specifically, that 
> >>>>>>>> the
> >>>>>>>> document _body_ is stored as an uninterpreted binary value. This 
> >>>>>>>> would be
> >>>>>>>> much like the obvious plan for attachment storage; a key prefix that
> >>>>>>>> identifies the database and document, with the final item of that 
> >>>>>>>> key tuple
> >>>>>>>> is an incrementing integer. Each of those keys has a binary value of 
> >>>>>>>> up to
> >>>>>>>> 100k. Fetching all values with that key prefix, in fdb's natural 
> >>>>>>>> ordering,
> >>>>>>>> will yield the full document body, which can be JSON decoded for 
> >>>>>>>> further
> >>>>>>>> processing.
> >>>>>>>>
> >>>>>>>> I like this idea, and I like Adam's original proposal to explode 
> >>>>>>>> documents
> >>>>>>>> into property paths. I have a slight preference for the simplicity 
> >>>>>>>> of the
> >>>>>>>> idea in the previous paragraph, not least because it's close to what 
> >>>>>>>> we do
> >>>>>>>> today. I also think it will be possible to migrate to alternative 
> >>>>>>>> storage
> >>>>>>>> models in future, and foundationdb's transaction supports means we 
> >>>>>>>> can do
> >>>>>>>> this migration seamlessly should we come to it.
> >>>>>>>>
> >>>>>>>> I'm very interested in knowing if anyone else is interested in going 
> >>>>>>>> this
> >>>>>>>> simple, or considers it a wasted opportunity relative to the 
> >>>>>>>> 'exploded'
> >>>>>>>> path.
> >>>>>>>>
> >>>>>>>> B.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Robert Samuel Newson
> >>>>>>>> rnew...@apache.org
> >>>>>>>>
> >>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote:
> >>>>>>>>> I've been remiss here in not posting the data model ideas that IBM
> >>>>>>>>> worked up while we were thinking about using FoundationDB so I'm 
> >>>>>>>>> posting
> >>>>>>>>> it now. This is Adam' Kocoloski's original work, I am just 
> >>>>>>>>> transcribing
> >>>>>>>>> it, and this is the context that the folks from the IBM side came in
> >>>>>>>>> with, for full disclosure.
> >>>>>>>>>
> >>>>>>>>> Basics
> >>>>>>>>>
> >>>>>>>>> 1. All CouchDB databases are inside a Directory
> >>>>>>>>> 2. Each CouchDB database is a Directory within that Directory
> >>>>>>>>> 3. It's possible to list all subdirectories of a Directory, so
> >>>>>>>>> `_all_dbs` is the list of directories from 1.
> >>>>>>>>> 4. Each Directory representing a CouchdB database has several 
> >>>>>>>>> Subspaces;
> >>>>>>>>> 4a. by_id/ doc subspace: actual document contents
> >>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed
> >>>>>>>>> 4c. index_definitions, indexes, ...
> >>>>>>>>>
> >>>>>>>>> JSON Mapping
> >>>>>>>>>
> >>>>>>>>> A hierarchical JSON object naturally maps to multiple KV pairs in 
> >>>>>>>>> FDB:
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>> “_id”: “foo”,
> >>>>>>>>> “owner”: “bob”,
> >>>>>>>>> “mylist”: [1,3,5],
> >>>>>>>>> “mymap”: {
> >>>>>>>>>     “blue”: “#0000FF”,
> >>>>>>>>>     “red”: “#FF0000”
> >>>>>>>>> }
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> maps to
> >>>>>>>>>
> >>>>>>>>> (“foo”, “owner”) = “bob”
> >>>>>>>>> (“foo”, “mylist”, 0) = 1
> >>>>>>>>> (“foo”, “mylist”, 1) = 3
> >>>>>>>>> (“foo”, “mylist”, 2) = 5
> >>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF”
> >>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000”
> >>>>>>>>>
> >>>>>>>>> NB: this means that the 100KB limit applies to individual leafs in 
> >>>>>>>>> the
> >>>>>>>>> JSON object, not the entire doc
> >>>>>>>>>
> >>>>>>>>> Edit Conflicts
> >>>>>>>>>
> >>>>>>>>> We need to account for the presence of conflicts in various levels 
> >>>>>>>>> of
> >>>>>>>>> the doc due to replication.
> >>>>>>>>>
> >>>>>>>>> Proposal is to create a special value indicating that the subtree 
> >>>>>>>>> below
> >>>>>>>>> our current cursor position is in an unresolvable conflict. Then add
> >>>>>>>>> additional KV pairs below to describe the conflicting entries.
> >>>>>>>>>
> >>>>>>>>> KV data model allows us to store these efficiently and minimize
> >>>>>>>>> duplication of data:
> >>>>>>>>>
> >>>>>>>>> A document with these two conflicts:
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>> “_id”: “foo”,
> >>>>>>>>> “_rev”: “1-abc”,
> >>>>>>>>> “owner”: “alice”,
> >>>>>>>>> “active”: true
> >>>>>>>>> }
> >>>>>>>>> {
> >>>>>>>>> “_id”: “foo”,
> >>>>>>>>> “_rev”: “1-def”,
> >>>>>>>>> “owner”: “bob”,
> >>>>>>>>> “active”: true
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> could be stored thus:
> >>>>>>>>>
> >>>>>>>>> (“foo”, “active”) = true
> >>>>>>>>> (“foo”, “owner”) = kCONFLICT
> >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
> >>>>>>>>> (“foo”, “owner”, “1-def”) = “bob”
> >>>>>>>>>
> >>>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting subtree 
> >>>>>>>>> this
> >>>>>>>>> representation can handle conflicts of different data types as well.
> >>>>>>>>>
> >>>>>>>>> Missing fields need to be handled explicitly:
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>> “_id”: “foo”,
> >>>>>>>>> “_rev”: “1-abc”,
> >>>>>>>>> “owner”: “alice”,
> >>>>>>>>> “active”: true
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>> “_id”: “foo”,
> >>>>>>>>> “_rev”: “1-def”,
> >>>>>>>>> “owner”: {
> >>>>>>>>> “name”: “bob”,
> >>>>>>>>> “email”: “
> >>>>>>>>> b...@example.com
> >>>>>>>>> "
> >>>>>>>>> }
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> could be stored thus:
> >>>>>>>>>
> >>>>>>>>> (“foo”, “active”) = kCONFLICT
> >>>>>>>>> (“foo”, “active”, “1-abc”) = true
> >>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING
> >>>>>>>>> (“foo”, “owner”) = kCONFLICT
> >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
> >>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
> >>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ...
> >>>>>>>>>
> >>>>>>>>> Revision Metadata
> >>>>>>>>>
> >>>>>>>>> * CouchDB uses a hash history for revisions
> >>>>>>>>> ** Each edit is identified by the hash of the content of the edit
> >>>>>>>>> including the base revision against which it was applied
> >>>>>>>>> ** Individual edit branches are bounded in length but the number of
> >>>>>>>>> branches is potentially unbounded
> >>>>>>>>>
> >>>>>>>>> * Size limits preclude us from storing the entire key tree as a 
> >>>>>>>>> single
> >>>>>>>>> value; in pathological situations
> >>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes)
> >>>>>>>>>
> >>>>>>>>> * Store each edit branch as a separate KV including deleted status 
> >>>>>>>>> in a
> >>>>>>>>> special subspace
> >>>>>>>>>
> >>>>>>>>> * Structure key representation so that “winning” revision can be
> >>>>>>>>> automatically retrieved in a limit=1
> >>>>>>>>> key range operation
> >>>>>>>>>
> >>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = []
> >>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = 
> >>>>>>>>> [“3-baz”,”2-bar”,”1-foo”]
> >>>>>>>>> <-- winner
> >>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”]
> >>>>>>>>>
> >>>>>>>>> Changes Feed
> >>>>>>>>>
> >>>>>>>>> * FDB supports a concept called a versionstamp — a 10 byte, unique,
> >>>>>>>>> monotonically (but not sequentially) increasing value for each 
> >>>>>>>>> committed
> >>>>>>>>> transaction. The first 8 bytes are the committed version of the
> >>>>>>>>> database. The last 2 bytes are monotonic in the serialization order 
> >>>>>>>>> for
> >>>>>>>>> transactions.
> >>>>>>>>>
> >>>>>>>>> * A transaction can specify a particular index into a key where the
> >>>>>>>>> following 10 bytes will be overwritten by the versionstamp at commit
> >>>>>>>>> time
> >>>>>>>>>
> >>>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes feed
> >>>>>>>>>
> >>>>>>>>> by_seq subspace
> >>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”)
> >>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”)
> >>>>>>>>>
> >>>>>>>>> by_id subspace
> >>>>>>>>> (“bar”, “_vsn”) = “versionstamp4”
> >>>>>>>>> ...
> >>>>>>>>> (“foo”, “_vsn”) = “versionstamp1”
> >>>>>>>>>
> >>>>>>>>> JSON Indexes
> >>>>>>>>>
> >>>>>>>>> * “Mango” JSON indexes are defined by
> >>>>>>>>> ** a list of field names, each of which may be nested,
> >>>>>>>>> ** an optional partial_filter_selector which constrains the set of 
> >>>>>>>>> docs
> >>>>>>>>> that contribute
> >>>>>>>>> ** an optional name defined by the ddoc field (the name is auto-
> >>>>>>>>> generated if not supplied)
> >>>>>>>>>
> >>>>>>>>> * Store index definitions in a single subspace to aid query planning
> >>>>>>>>> ** ((person,name), title, email) = (“name-title-email”, “{“student”:
> >>>>>>>>> true}”)
> >>>>>>>>> ** Store the values for each index in a dedicated subspace, adding 
> >>>>>>>>> the
> >>>>>>>>> document ID as the last element in the tuple
> >>>>>>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) = null
> >>>>>>>>>
> >>>>>>>>> B.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Robert Samuel Newson
> >>>>>>>>> rnew...@apache.org
> >>>>>>>>>
> >>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
> >>>>>>>>>>
> >>>>>>>>>> I want to fix previous mistakes. I did two mistakes in previous
> >>>>>>>>>> calculations:
> >>>>>>>>>> - I used 1Kb as base size for calculating expansion factor 
> >>>>>>>>>> (although
> >>>>>>>> we
> >>>>>>>>>> don't know exact size of original document)
> >>>>>>>>>> - The expansion factor calculation included number of revisions (it
> >>>>>>>>>> shouldn't)
> >>>>>>>>>>
> >>>>>>>>>> I'll focus on flattened JSON docs model
> >>>>>>>>>>
> >>>>>>>>>> The following formula is used in previous calculation.
> >>>>>>>>>> storage_size_per_document=mapping_table_size*number_of_revisions +
> >>>>>>>>>> depth*number_of_paths*number_of_revisions +
> >>>>>>>>>> number_of_paths*value_size*number_of_revisions
> >>>>>>>>>>
> >>>>>>>>>> To clarify things a little bit I want to calculate space 
> >>>>>>>>>> requirement
> >>>>>>>> for
> >>>>>>>>>> single revision this time.
> >>>>>>>>>> mapping_table_size=field_name_size*(field_name_length+4(integer
> >>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes
> >>>>>>>>>> storage_size_per_document_per_revision_per_replica=mapping_table_size
> >>>>>>>> +
> >>>>>>>>>> depth*number_of_paths + value_size*number_of_paths =
> >>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb
> >>>>>>>>>>
> >>>>>>>>>> We definitely can reduce requirement for mapping table by adopting
> >>>>>>>>>> rnewson's idea of a schema.
> >>>>>>>>>>
> >>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> wrote:
> >>>>>>>>>>> Hi Michael,
> >>>>>>>>>>>
> >>>>>>>>>>>> For example, hears a crazy thought:
> >>>>>>>>>>>> Map every distinct occurence of a key/value instance through a
> >>>>>>>> crypto hash
> >>>>>>>>>>>> function to get a set of hashes.
> >>>>>>>>>>>>
> >>>>>>>>>>>> These can be be precomputed by Couch without any lookups in FDB.
> >>>>>>>> These
> >>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend
> >>>>>>>> themselves to
> >>>>>>>>>>>> range search well.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So what you do is index them for frequency of occurring in the
> >>>>>>>> same set.
> >>>>>>>>>>>> In essence, you 'bucket them' statistically, and that bucket id
> >>>>>>>> becomes a
> >>>>>>>>>>>> key prefix. A crypto hash value can be copied into more than one
> >>>>>>>> bucket.
> >>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> >>>>>>>>>>>
> >>>>>>>>>>>> When writing a document, Couch submits the list/array of
> >>>>>>>> cryptohash values
> >>>>>>>>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
> >>>>>>>> id with
> >>>>>>>>>>>> the bucket prefixed).  This can get somewhat expensive if there's
> >>>>>>>> always a
> >>>>>>>>>>>> lot of app local cache misses.
> >>>>>>>>>>>>
> >>>>>>>>>>>> A document's value is then a series of {val_id} arrays up to 100k
> >>>>>>>> per
> >>>>>>>>>>>> segment.
> >>>>>>>>>>>>
> >>>>>>>>>>>> When retrieving a document, you get the val_ids, find the 
> >>>>>>>>>>>> distinct
> >>>>>>>> buckets
> >>>>>>>>>>>> and min/max entries for this doc, and then parallel query each
> >>>>>>>> bucket while
> >>>>>>>>>>>> reconstructing the document.
> >>>>>>>>>>>
> >>>>>>>>>>> Interesting idea. Let's try to think it through to see if we can
> >>>>>>>> make it viable.
> >>>>>>>>>>> Let's go through hypothetical example. Input data for the example:
> >>>>>>>>>>> - 1M of documents
> >>>>>>>>>>> - each document is around 10Kb
> >>>>>>>>>>> - each document consists of 1K of unique JSON paths
> >>>>>>>>>>> - each document has 100 unique JSON field names
> >>>>>>>>>>> - every scalar value is 100 bytes
> >>>>>>>>>>> - 10% of unique JSON paths for every document already stored in
> >>>>>>>> database under different doc or different revision of the current one
> >>>>>>>>>>> - we assume 3 independent copies for every key-value pair in FDB
> >>>>>>>>>>> - our hash key size is 32 bytes
> >>>>>>>>>>> - let's assume we can determine if key is already on the storage
> >>>>>>>> without doing query
> >>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real live the
> >>>>>>>> percentage is lower)
> >>>>>>>>>>> - every JSON field name is 20 bytes
> >>>>>>>>>>> - every JSON path is 10 levels deep
> >>>>>>>>>>> - document key prefix length is 50
> >>>>>>>>>>> - every document has 10 revisions
> >>>>>>>>>>> Let's estimate the storage requirements and size of data we need 
> >>>>>>>>>>> to
> >>>>>>>> transmit. The calculations are not exact.
> >>>>>>>>>>> 1. storage_size_per_document (we cannot estimate exact numbers 
> >>>>>>>>>>> since
> >>>>>>>> we don't know how FDB stores it)
> >>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) =
> >>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x)
> >>>>>>>>>>> 2. number of independent keys to retrieve on document read
> >>>>>>>> (non-range queries) per document
> >>>>>>>>>>> - 1K - (1K * 1%) = 990
> >>>>>>>>>>> 3. number of range queries: 0
> >>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32
> >>>>>>>> bytes) = 102 Kb (10x)
> >>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from
> >>>>>>>> https://apple.github.io/foundationdb/performance.html)
> >>>>>>>>>>> - sequential: 990*2ms = 1980ms
> >>>>>>>>>>> - range: 0
> >>>>>>>>>>> Let's compare these numbers with initial proposal (flattened JSON
> >>>>>>>> docs without global schema and without cache)
> >>>>>>>>>>> 1. storage_size_per_document
> >>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
> >>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes
> >>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 =
> >>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x)
> >>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on index
> >>>>>>>> structure)
> >>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result)
> >>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb
> >>>>>>>> (2.4x)
> >>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from
> >>>>>>>> https://apple.github.io/foundationdb/performance.html and estimate 
> >>>>>>>> range
> >>>>>>>> read performance based on numbers from
> >>>>>>>> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test
> >>>>>>>> )
> >>>>>>>>>>> - range read performance: Given read performance is about 305,000
> >>>>>>>> reads/second and range performance 3,600,000 keys/second we estimate 
> >>>>>>>> range
> >>>>>>>> performance to be 11.8x compared to read performance. If read 
> >>>>>>>> performance
> >>>>>>>> is 2ms than range performance is 0.169ms (which is hard to believe).
> >>>>>>>>>>> - sequential: 2 * 2 = 4ms
> >>>>>>>>>>> - range: 0.169
> >>>>>>>>>>>
> >>>>>>>>>>> It looks like we are dealing with a tradeoff:
> >>>>>>>>>>> - Map every distinct occurrence of a key/value instance through a
> >>>>>>>> crypto hash:
> >>>>>>>>>>> - 5.39x more disk space efficient
> >>>>>>>>>>> - 474x slower
> >>>>>>>>>>> - flattened JSON model
> >>>>>>>>>>> - 5.39x less efficient in disk space
> >>>>>>>>>>> - 474x faster
> >>>>>>>>>>>
> >>>>>>>>>>> In any case this unscientific exercise was very helpful. Since it
> >>>>>>>> uncovered the high cost in terms of disk space. 59.3x of original 
> >>>>>>>> disk size
> >>>>>>>> is too much IMO.
> >>>>>>>>>>>
> >>>>>>>>>>> Are the any ways we can make Michael's model more performant?
> >>>>>>>>>>>
> >>>>>>>>>>> Also I don't quite understand few aspects of the global hash table
> >>>>>>>> proposal:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. > - Map every distinct occurence of a key/value instance 
> >>>>>>>>>>> through
> >>>>>>>> a crypto hash function to get a set of hashes.
> >>>>>>>>>>> I think we are talking only about scalar values here? I.e.
> >>>>>>>> `"#/foo.bar.baz": 123`
> >>>>>>>>>>> Since I don't know how we can make it work for all possible JSON
> >>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}":
> >>>>>>>>>>> - foo
> >>>>>>>>>>> - foo.bar
> >>>>>>>>>>> - foo.bar.baz
> >>>>>>>>>>>
> >>>>>>>>>>> 2. how to delete documents
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>> ILYA
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair <mich...@daclubhouse.net>
> >>>>>>>> wrote:
> >>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski 
> >>>>>>>>>>>> <kocol...@apache.org
> >>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Michael,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes that’s definitely one way to address storage of edit
> >>>>>>>> conflicts. I
> >>>>>>>>>>>>> think there are other, more compact representations that we can
> >>>>>>>> explore if
> >>>>>>>>>>>>> we have this “exploded” data model where each scalar value maps
> >>>>>>>> to an
> >>>>>>>>>>>>> individual KV pair.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I agree, as I mentioned on the original thread, I see a scheme,
> >>>>>>>> that
> >>>>>>>>>>>> handles both conflicts and revisions, where you only have to 
> >>>>>>>>>>>> store
> >>>>>>>> the most
> >>>>>>>>>>>> recent change to a field.  Like you suggested, multiple revisions
> >>>>>>>> can share
> >>>>>>>>>>>> a key.  Which in my mind's eye further begs the 
> >>>>>>>>>>>> conflicts/revisions
> >>>>>>>>>>>> discussion along with the working within the limits discussion
> >>>>>>>> because it
> >>>>>>>>>>>> seems to me they are all intrinsically related as a "feature".
> >>>>>>>>>>>>
> >>>>>>>>>>>> Saying 'We'll break documents up into roughly 80k segments', then
> >>>>>>>> trying to
> >>>>>>>>>>>> overlay some kind of field sharing scheme for revisions/conflicts
> >>>>>>>> doesn't
> >>>>>>>>>>>> seem like it will work.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I probably should have left out the trivial fix proposal as I
> >>>>>>>> don't think
> >>>>>>>>>>>> it's a feasible solution to actually use.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The comment is more regarding that I do not see how this thread
> >>>>>>>> can escape
> >>>>>>>>>>>> including how to store/retrieve conflicts/revisions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For instance, the 'doc as individual fields' proposal lends 
> >>>>>>>>>>>> itself
> >>>>>>>> to value
> >>>>>>>>>>>> sharing across mutiple documents (and I don't just mean revisions
> >>>>>>>> of the
> >>>>>>>>>>>> same doc, I mean the same key/value instance could be shared for
> >>>>>>>> every
> >>>>>>>>>>>> document).
> >>>>>>>>>>>> However that's not really relevant if we're not considering the
> >>>>>>>> amount of
> >>>>>>>>>>>> shared information across documents in the storage scheme.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Simply storing documents in <100k segments (perhaps in some kind 
> >>>>>>>>>>>> of
> >>>>>>>>>>>> compressed binary representation) to deal with that FDB limit
> >>>>>>>> seems fine.
> >>>>>>>>>>>> The only reason to consider doing something else is because of 
> >>>>>>>>>>>> its
> >>>>>>>> impact
> >>>>>>>>>>>> to indexing, searches, reduce functions, revisions, on-disk size
> >>>>>>>> impact,
> >>>>>>>>>>>> etc.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm assuming the process will flatten the key paths of the
> >>>>>>>> document into
> >>>>>>>>>>>>> an array and then request the value of each key as multiple
> >>>>>>>> parallel
> >>>>>>>>>>>>> queries against FDB at once
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s trying
> >>>>>>>> to design a
> >>>>>>>>>>>>> model which allows the retrieval of a document with a single
> >>>>>>>> range read,
> >>>>>>>>>>>>> which is a good goal in my opinion.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am not sure I agree.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Think of bitTorrent, a single range read should pull back the
> >>>>>>>> structure of
> >>>>>>>>>>>> the document (the pieces to fetch), but not necessarily the whole
> >>>>>>>> document.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What if you already have a bunch of pieces in common with other
> >>>>>>>> documents
> >>>>>>>>>>>> locally (a repeated header/footer/ or type for example); and you
> >>>>>>>> only need
> >>>>>>>>>>>> to get a few pieces of data you don't already have?
> >>>>>>>>>>>>
> >>>>>>>>>>>> The real goal to Couch I see is to treat your document set like 
> >>>>>>>>>>>> the
> >>>>>>>>>>>> collection of structured information that it is.  In some 
> >>>>>>>>>>>> respects
> >>>>>>>> like an
> >>>>>>>>>>>> extension of your application's heap space for structured objects
> >>>>>>>> and
> >>>>>>>>>>>> efficiently querying that collection to get back subsets of the
> >>>>>>>> data.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Otherwise it seems more like a slightly upgraded file system plus
> >>>>>>>> a fancy
> >>>>>>>>>>>> grep/find like feature...
> >>>>>>>>>>>>
> >>>>>>>>>>>> The best way I see to unlock more features/power is to a move
> >>>>>>>> towards a
> >>>>>>>>>>>> more granular and efficient way to store and retrieve the scalar
> >>>>>>>> values...
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> For example, hears a crazy thought:
> >>>>>>>>>>>> Map every distinct occurence of a key/value instance through a
> >>>>>>>> crypto hash
> >>>>>>>>>>>> function to get a set of hashes.
> >>>>>>>>>>>>
> >>>>>>>>>>>> These can be be precomputed by Couch without any lookups in FDB.
> >>>>>>>> These
> >>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend
> >>>>>>>> themselves to
> >>>>>>>>>>>> range search well.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So what you do is index them for frequency of occurring in the
> >>>>>>>> same set.
> >>>>>>>>>>>> In essence, you 'bucket them' statistically, and that bucket id
> >>>>>>>> becomes a
> >>>>>>>>>>>> key prefix. A crypto hash value can be copied into more than one
> >>>>>>>> bucket.
> >>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> >>>>>>>>>>>>
> >>>>>>>>>>>> When writing a document, Couch submits the list/array of
> >>>>>>>> cryptohash values
> >>>>>>>>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
> >>>>>>>> id with
> >>>>>>>>>>>> the bucket prefixed).  This can get somewhat expensive if there's
> >>>>>>>> always a
> >>>>>>>>>>>> lot of app local cache misses.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> A document's value is then a series of {val_id} arrays up to 100k
> >>>>>>>> per
> >>>>>>>>>>>> segment.
> >>>>>>>>>>>>
> >>>>>>>>>>>> When retrieving a document, you get the val_ids, find the 
> >>>>>>>>>>>> distinct
> >>>>>>>> buckets
> >>>>>>>>>>>> and min/max entries for this doc, and then parallel query each
> >>>>>>>> bucket while
> >>>>>>>>>>>> reconstructing the document.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The values returned from the buckets query are the key/value
> >>>>>>>> strings
> >>>>>>>>>>>> required to reassemble this document.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> ----------
> >>>>>>>>>>>> I put this forward primarily to hilite the idea that trying to
> >>>>>>>> match the
> >>>>>>>>>>>> storage representation of documents in a straight forward way to
> >>>>>>>> FDB keys
> >>>>>>>>>>>> to reduce query count might not be the most performance oriented
> >>>>>>>> approach.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'd much prefer a storage approach that reduced data duplication
> >>>>>>>> and
> >>>>>>>>>>>> enabled fast sub-document queries.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> This clearly falls in the realm of what people want the 'use 
> >>>>>>>>>>>> case'
> >>>>>>>> of Couch
> >>>>>>>>>>>> to be/become.  By giving Couch more access to sub-document
> >>>>>>>> queries, I could
> >>>>>>>>>>>> eventually see queries as complicated as GraphQL submitted to
> >>>>>>>> Couch and
> >>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple documents in 
> >>>>>>>>>>>> a
> >>>>>>>> single
> >>>>>>>>>>>> application layer request.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hehe - one way to look at the database of Couch documents is that
> >>>>>>>> they are
> >>>>>>>>>>>> all conflict revisions of the single root empty document.   What 
> >>>>>>>>>>>> I
> >>>>>>>> mean be
> >>>>>>>>>>>> this is consider thinking of the entire document store as one
> >>>>>>>> giant DAG of
> >>>>>>>>>>>> key/value pairs. How even separate documents are still typically
> >>>>>>>> related to
> >>>>>>>>>>>> each other.  For most applications there is a tremendous amount 
> >>>>>>>>>>>> of
> >>>>>>>> data
> >>>>>>>>>>>> redundancy between docs and especially between revisions of those
> >>>>>>>> docs...
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> And all this is a long way of saying "I think there could be a 
> >>>>>>>>>>>> lot
> >>>>>>>> of value
> >>>>>>>>>>>> in assuming documents are 'assembled' from multiple queries to
> >>>>>>>> FDB, with
> >>>>>>>>>>>> local caching, instead of simply retrieved"
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks, I hope I'm not the only outlier here thinking this way!?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Mike :-)
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >
>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to