On Mon, Jan 4, 2021 at 8:09 PM ChrisLu <chris...@gmail.com> wrote: > Hi, > > For a list of json objects, the key names are usually repeated. > e.g., {aaaa:1, bbbb:2},{aaaa:2, bbbb:3}, ... > > The key names, "aaaa" and "bbbb" for the above example, could be very long. > Is there any existing library already encode json objects via a dictionary? > > This is a JSON-specific compression. Would be good to see the compression > ratio compared to gzip, which has a general dictionary encoding. > > It is a bit old, but take a look at Transit. It's a JSON to JSON encoding format where there's a concept of a rolling implicit cache on the objects, much like you would see in a compression system through a dictionary method. One particularly interesting case is that decoding Transit into a language representation can often be faster than decoding the exact same gzipped JSON. The reason being that once un-gzipped, the parser has to work on more bytes than in the Transit case.
My usual approach is often to just use Protobuf 3 nowadays. Especially if you have a good grasp on the structure of data and it's not very volatile. If you have very large amounts of data, and data is highly tabular, you could look into ORC (Optimized Row Columnar) or Parquet since the columnar formats tend to win out. The usual key trick in those formats are two-fold: decouple the description from the data, and store data that are the same locally close to each other by looking at columns rather than rows. For a completely different approach, Joe Armstrong's UBF[0]. Data is encoded as a program that can be executed by a simple stack machine with registers for caching. Decoding amounts to executing said stack machine in order to produce the language specific representation. For instance 'person'>p # {p "Joe" 123} & {p 'fred' 3~abc~} & $ will store the string 'person' in the p register and then proceed to use it in the following structure definitions. A format like these are interesting hybrids in that they are self-describing (as JSON), but also allow for compression in order to avoid the repetition of something like JSON. So they fall somewhere in between Protobuf and columnar formats. [0] Universal Binary Format: http://ubf.github.io/ubf/ubf-user-guide.en.html -- J. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAGrdgiUKZMPiavDsHBuU3B20A6n6APo2LO-QkZ7b_Rb0hzNYOQ%40mail.gmail.com.