Heya Mike, this is excellent input and exactly the type of stuff we want to nail down in subsequent discussions.
Best Jan — > On 24. Jan 2019, at 13:46, Michael Fair <mich...@daclubhouse.net> wrote: > > On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <rnew...@apache.org> > wrote: > >> >> We’d expand each document into a series of key-value pairs, where the key >> is the full path into the object and the value is the scalar value. E.g, >> >> {“foo”: 12, “bar”, {“baz”: 13}} >> >> Would be >> >> foo => 12 >> bar.baz => 13 > > > I realize this quickly belongs in its own thread for later discussion, but > I wanted to point out/ask that by "interning the path strings" or using > some kind of deterministic hash algorithm, like SHA256 (or something > faster), on the "key path", couldn't you turn all variable-length strings > paths into a fixed size, integer type, field id? > > This eliminates the "length" of the path string concern and keeps every > document field a straight three entry path: > docid.revisionid.fieldid => [removed?, value] > > where: > * docid is the unique document identifier > * revisionid is obvious > * fieldid is the id of the path string (if a deterministic hash is used, > it's computed; if indexed, it's looked up/retrieved) > > This idea assumes that the "path.string" <-> fieldid correlation is also > managed by interning those strings somewhere. > > By adding the removed bit flag, a document becomes simply the aggregation > of all the latest revisionids for each distinct fieldid lower than the > revisionid requested; eliminating all duplicate storage requirements for > non-changing fields. > > When a document update comes in, it breaks the document down into its > constituent fields, and only needs to add an entry if the state of a field > has somehow changed from its previous revision. > > It seems like this whole idea might be optimally and transparently handled > directly inside FDB if FDB was aware of this revisionid "idea". I'm of > course not sure which system is expected to handle the described document > deconstruction. > > > ====== > This "fieldid hash" idea is also related to how the IPLD project creates > "pointers" to JSON documents inside its distributed p2p system to > hierarchically link portions of different documents together. > > Since a particular docid.revisionid represents a fixed point/state of a > document in the database, they use that reference as the "value" of a > special JSON Object that wants to "include"/"point to" the referenced > document. > The special JSON Object they used to create a "document link" looks like > this: {"/": "documenthashid"} > > The uploading document must explicitly put that reference in its own > document where it wants the system to link in the referenced document. > This hijacks this form of a JSON Object for this specific purpose and > prevents all higher level applications of IPLD from using it for any other > purpose. > > If desirable, the equivalent idea for CouchDB might be: {"_/": > "docid.revisionid.fieldid"} > > ====== > > I'm not saying any of this is a good idea, simply that (1) the string > length concerns could be eliminated by using interned strings (which likely > would also improve performance); and (2) this field level storage in FDB > could enable a basis for adding "document pointers" which I'm sure many > people would appreciate. > > > Mike -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/