Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Ilya Khlopotov Wed, 30 Jan 2019 11:54:49 -0800

Hi Mike,

> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
This doesn't solve the issue with scalar values being over the limits 
FoundationDB can support.


Best regards,
iilyak

On 2019/01/30 19:00:15, Michael Fair <mich...@daclubhouse.net> wrote: 
> I know the claim was to avoid "revisions" and "conflicts" discussion in
> this thread but isn't that unavoidable.
> 
> In scheme #1 you have multiple keys with the same DOCID/PART_IDX but
> different data.
> In schemes #2 / #3 you have multiple copies of the JSON_PATH but different
> values.
> 
> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> 
> Mike
> 
> On Wed, Jan 30, 2019 at 9:53 AM Ilya Khlopotov <iil...@apache.org> wrote:
> 
> > FoundationDB Records layer uses global schema for JSON documents. They
> > also have a nice way of creating indexes and schema evolution support.
> > However this support comes at a cost of extra lookups in different
> > subspace. With local mapping table we almost (except a corner case) certain
> > that the schema and JSON fields would be collocated on a single node. Due
> > to common prefix.
> >
> > Best regards,
> > iilyak
> > On 2019/01/30 17:05:01, Jan Lehnardt <j...@apache.org> wrote:
> > > Ah sure, if we store the *cough* schema per doc, then it's not that
> > easy. An iteration of this proposal could store paths globally with ids
> > that the k/v store then uses for keys, which would enable what I described,
> > but happy to ignore this for the time being. :)
> > >
> > > Cheers
> > > Jan
> > > —
> > >
> > > > On 30. Jan 2019, at 17:58, Adam Kocoloski <kocol...@apache.org> wrote:
> > > >
> > > > Jan, I don’t think it does have that "fun property #2", as the mapping
> > is created separately for each document. In this proposal the field name
> > “foo” could map to 2 in one document and 42 in another.
> > > >
> > > > Thanks for the proposal Ilya. Personally I wonder if the 10KB limit on
> > field paths is anything more than a theoretical concern. It’s hard for me
> > to imagine a useful schema that would get anywhere near that deep, but
> > maybe I’m insufficiently creative :) There’s certainly a storage overhead
> > from repeating the upper portion of a path over and over again, but that’s
> > also something the storage engine can optimize away through prefix elision.
> > The current production storage engine in FoundationDB does not do this
> > elision, but the new one in development does.
> > > >
> > > > The value size limit is probably not so theoretical. I think as a
> > project we could choose to impose a 100KB size limit on scalar values - a
> > user who had a string longer than 100KB could chunk it up into an array of
> > strings pretty easily to work around that limit. But let’s say we don’t
> > want to impose that limit. In your design, how do I distinguish {PART_IDX}
> > from the elements of the {JSON_PATH}? I was kind of expecting to see some
> > magic value indicating that the subsequent set of keys with the same prefix
> > are all elements of a “multi-part object”:
> > > >
> > > > {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}  = kMULTIPART
> > > > {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}  = “First 100 KB …"
> > > > ...
> > > >
> > > > You might have figured out something more efficient that saves a KV
> > here but I can’t quite grok it.
> > > >
> > > > Cheers, Adam
> > > >
> > > >
> > > >> On Jan 30, 2019, at 8:24 AM, Jan Lehnardt <j...@apache.org> wrote:
> > > >>
> > > >>
> > > >>
> > > >>> On 30. Jan 2019, at 14:22, Jan Lehnardt <j...@apache.org <mailto:
> > j...@apache.org>> wrote:
> > > >>>
> > > >>> Thanks Ilya for getting this started!
> > > >>>
> > > >>> Two quick notes on this one:
> > > >>>
> > > >>> 1. note that JSON does not guarantee object key order and that
> > CouchDB has never guaranteed it either, and with say emit(doc.foo,
> > doc.bar), if either emit() parameter was an object, the
> > undefined-sort-order of SpiderMonkey would mix things up. While worth
> > bringing up, this is not a BC break.
> > > >>>
> > > >>> 2. This would have the fun property of being able to rename a key
> > inside all docs that have that key.
> > > >>
> > > >> …in one short operation.
> > > >>
> > > >> Best
> > > >> Jan
> > > >> —
> > > >>>
> > > >>> Best
> > > >>> Jan
> > > >>> —
> > > >>>
> > > >>>> On 30. Jan 2019, at 14:05, Ilya Khlopotov <iil...@apache.org>
> > wrote:
> > > >>>>
> > > >>>> # First proposal
> > > >>>>
> > > >>>> In order to overcome FoudationDB limitations on key size (10 kB)
> > and value size (100 kB) we could use the following approach.
> > > >>>>
> > > >>>> Bellow the paths are using slash for illustration purposes only. We
> > can use nested subspaces, tuples, directories or something else.
> > > >>>>
> > > >>>> - Store documents in a subspace or directory  (to keep prefix for a
> > key short)
> > > >>>> - When we store the document we would enumerate all field names (0
> > and 1 are reserved) and store the mapping table in the key which look like:
> > > >>>> ```
> > > >>>> {DB_DOCS_NS} / {DOC_KEY} / 0
> > > >>>> ```
> > > >>>> - Flatten the JSON document (convert it into key value pairs where
> > the key is `JSON_PATH` and value is `SCALAR_VALUE`)
> > > >>>> - Replace elements of JSON_PATH with integers from mapping table we
> > constructed earlier
> > > >>>> - When we have array use `1 / {array_idx}`
> > > >>>> - Store scalar values in the keys which look like the following (we
> > use `JSON_PATH` with integers).
> > > >>>> ```
> > > >>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}
> > > >>>> ```
> > > >>>> - If the scalar value exceeds 100kB we would split it and store
> > every part under key constructed as:
> > > >>>> ```
> > > >>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}
> > > >>>> ```
> > > >>>>
> > > >>>> Since all parts of the documents are stored under a common
> > `{DB_DOCS_NS} / {DOC_KEY}` they will be stored on the same server most of
> > the time. The document can be retrieved by using range query
> > (`txn.get_range("{DB_DOCS_NS} / {DOC_KEY} / 0", "{DB_DOCS_NS} / {DOC_KEY} /
> > 0xFF")`). We can reconstruct the document since the mapping is returned as
> > well.
> > > >>>>
> > > >>>> The downside of this approach is we wouldn't be able to ensure the
> > same order of keys in the JSON object. Currently the `jiffy` JSON encoder
> > respects order of keys.
> > > >>>> ```
> > > >>>> 4> jiffy:encode({[{bbb, 1}, {aaa, 12}]}).
> > > >>>> <<"{\"bbb\":1,\"aaa\":12}">>
> > > >>>> 5> jiffy:encode({[{aaa, 12}, {bbb, 1}]}).
> > > >>>> <<"{\"aaa\":12,\"bbb\":1}">>
> > > >>>> ```
> > > >>>>
> > > >>>> Best regards,
> > > >>>> iilyak
> > > >>>>
> > > >>>>> On 2019/01/30 13:02:57, Ilya Khlopotov <iil...@apache.org> wrote:
> > > >>>>> As you might already know the FoundationDB has a number of
> > limitations which influences the way we might store JSON documents. The
> > limitations are:
> > > >>>>>
> > > >>>>> |      limitation             |recommended value|recommended
> > max|absolute max|
> > > >>>>>
> > |-------------------------|----------------------:|--------------------:|--------------:|
> > > >>>>> | transaction duration  |                              |
> >                  |      5 sec      |
> > > >>>>> | transaction data size |                              |
> >                  |      10 Mb     |
> > > >>>>> | key size                   |                 32 bytes |
> >          1 kB  |     10 kB      |
> > > >>>>> | value size                |                               |
> >             10 kB |    100 kB     |
> > > >>>>>
> > > >>>>> In order to fit the JSON document into 100kB we would have to
> > partition it in some way. There are three ways of partitioning the document
> > > >>>>> 1. store multiple binary blobs (parts) in different keys
> > > >>>>> 2. flatten JSON structure and store every path leading to a scalar
> > value under own key
> > > >>>>> 3. measure the size of different branches of a tree representing
> > the JSON document (while we parse) and use another key for the branch when
> > we about to exceed the limit
> > > >>>>>
> > > >>>>> - The first approach is the simplest but it wouldn't allow us to
> > access parts of the document.
> > > >>>>> - The downsides of a second approach are:
> > > >>>>> - flattened JSON structure would have long paths which means
> > longer keys
> > > >>>>> - the scalar value cannot be more than 100kb (unless we split it
> > as well)
> > > >>>>> - Third approach falls short in cases when the structure of the
> > document doesn't allow a clean cut off branches:
> > > >>>>> - complex rules to handle all corner cases
> > > >>>>>
> > > >>>>> The goals of this thread are:
> > > >>>>> - to collect ideas on how to encode and store the JSON document
> > > >>>>> - to comment on the collected ideas
> > > >>>>>
> > > >>>>> Non goals:
> > > >>>>> - the storage of metadata for the document would be discussed
> > elsewhere
> > > >>>>> - thumb stones
> > > >>>>> - edit conflicts
> > > >>>>> - revisions
> > > >>>>>
> > > >>>>> Best regards,
> > > >>>>> iilyak
> > > >>>>>
> > > >>>
> > > >>> --
> > > >>> Professional Support for Apache CouchDB:
> > > >>> https://neighbourhood.ie/couchdb-support/
> > > >>>
> > > >>
> > > >> --
> > > >> Professional Support for Apache CouchDB:
> > > >> https://neighbourhood.ie/couchdb-support/ <
> > https://neighbourhood.ie/couchdb-support/>
> > >
> > >
> >
>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to