On Mon, Jan 4, 2021 at 8:09 PM ChrisLu <chris...@gmail.com> wrote:

> Hi,
>
> For a list of json objects, the key names are usually repeated.
> e.g., {aaaa:1, bbbb:2},{aaaa:2, bbbb:3}, ...
>
> The key names, "aaaa" and "bbbb" for the above example, could be very long.
> Is there any existing library already encode json objects via a dictionary?
>
> This is a JSON-specific compression. Would be good to see the compression
> ratio compared to gzip, which has a general dictionary encoding.
>
>
It is a bit old, but take a look at Transit. It's a JSON to JSON encoding
format where there's a concept of a rolling implicit cache on the objects,
much like you would see in a compression system through a dictionary
method. One particularly interesting case is that decoding Transit into a
language representation can often be faster than decoding the exact same
gzipped JSON. The reason being that once un-gzipped, the parser has to work
on more bytes than in the Transit case.

My usual approach is often to just use Protobuf 3 nowadays. Especially if
you have a good grasp on the structure of data and it's not very volatile.
If you have very large amounts of data, and data is highly tabular, you
could look into ORC (Optimized Row Columnar) or Parquet since the columnar
formats tend to win out. The usual key trick in those formats are two-fold:
decouple the description from the data, and store data that are the same
locally close to each other by looking at columns rather than rows.

For a completely different approach, Joe Armstrong's UBF[0]. Data is
encoded as a program that can be executed by a simple stack machine with
registers for caching. Decoding amounts to executing said stack machine in
order to produce the language specific representation. For instance

'person'>p # {p "Joe" 123} & {p 'fred' 3~abc~} & $

will store the string 'person' in the p register and then proceed to use it
in the following structure definitions. A format like these are interesting
hybrids in that they are self-describing (as JSON), but also allow for
compression in order to avoid the repetition of something like JSON. So
they fall somewhere in between Protobuf and columnar formats.

[0] Universal Binary Format: http://ubf.github.io/ubf/ubf-user-guide.en.html




-- 
J.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAGrdgiUKZMPiavDsHBuU3B20A6n6APo2LO-QkZ7b_Rb0hzNYOQ%40mail.gmail.com.

Reply via email to