Heya Mike,

this is excellent input and exactly the type of stuff we want to nail down in 
subsequent discussions.

Best
Jan
—

> On 24. Jan 2019, at 13:46, Michael Fair <mich...@daclubhouse.net> wrote:
> 
> On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <rnew...@apache.org>
> wrote:
> 
>> 
>> We’d expand each document into a series of key-value pairs, where the key
>> is the full path into the object and the value is the scalar value. E.g,
>> 
>> {“foo”: 12, “bar”, {“baz”: 13}}
>> 
>> Would be
>> 
>> foo => 12
>> bar.baz => 13
> 
> 
> I realize this quickly belongs in its own thread for later discussion, but
> I wanted to point out/ask that by "interning the path strings" or using
> some kind of deterministic hash algorithm, like SHA256 (or something
> faster), on the "key path", couldn't you turn all variable-length strings
> paths into a fixed size, integer type, field id?
> 
> This eliminates the "length" of the path string concern and keeps every
> document field a straight three entry path:
> docid.revisionid.fieldid => [removed?, value]
> 
> where:
> * docid is the unique document identifier
> * revisionid is obvious
> * fieldid is the id of the path string (if a deterministic hash is used,
> it's computed; if indexed, it's looked up/retrieved)
> 
> This idea assumes that the "path.string" <-> fieldid correlation is also
> managed by interning those strings somewhere.
> 
> By adding the removed bit flag, a document becomes simply the aggregation
> of all the latest revisionids for each distinct fieldid lower than the
> revisionid requested; eliminating all duplicate storage requirements for
> non-changing fields.
> 
> When a document update comes in, it breaks the document down into its
> constituent fields, and only needs to add an entry if the state of a field
> has somehow changed from its previous revision.
> 
> It seems like this whole idea might be optimally and transparently handled
> directly inside FDB if FDB was aware of this revisionid "idea".  I'm of
> course not sure which system is expected to handle the described document
> deconstruction.
> 
> 
> ======
> This "fieldid hash" idea is also related to how the IPLD project creates
> "pointers" to JSON documents inside its distributed p2p system to
> hierarchically link portions of different documents together.
> 
> Since a particular docid.revisionid represents a fixed point/state of a
> document in the database, they use that reference as the "value" of a
> special JSON Object that wants to "include"/"point to" the referenced
> document.
> The special JSON Object they used to create a "document link" looks like
> this: {"/": "documenthashid"}
> 
> The uploading document must explicitly put that reference in its own
> document where it wants the system to link in the referenced document.
> This hijacks this form of a JSON Object for this specific purpose and
> prevents all higher level applications of IPLD from using it for any other
> purpose.
> 
> If desirable, the equivalent idea for CouchDB might be: {"_/":
> "docid.revisionid.fieldid"}
> 
> ======
> 
> I'm not saying any of this is a good idea, simply that (1) the string
> length concerns could be eliminated by using interned strings (which likely
> would also improve performance); and (2) this field level storage in FDB
> could enable a basis for adding "document pointers" which I'm sure many
> people would appreciate.
> 
> 
> Mike

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Reply via email to