On 5/28/21 6:35 AM, Tomas Vondra wrote: > >> >> IMO the main benefit of having different dictionaries is that you >> could have a small dictionary for small and very structured JSONB >> fields (e.g. some time-series data), and a large one for large / >> unstructured JSONB fields, without having the significant performance >> impact of having that large and varied dictionary on the >> small&structured field. Although a binary search is log(n) and thus >> still quite cheap even for large dictionaries, the extra size is >> certainly not free, and you'll be touching more memory in the process. >> > I'm sure we can think of various other arguments for allowing separate > dictionaries. For example, what if you drop a column? With one huge > dictionary you're bound to keep the data forever. With per-column dicts > you can just drop the dict and free disk space / memory. > > I also find it hard to believe that no one needs 2**16 strings. I mean, > 65k is not that much, really. To give an example, I've been toying with > storing bitcoin blockchain in a database - one way to do that is storing > each block as a single JSONB document. But each "item" (eg. transaction) > is identified by a unique hash, so that means (tens of) thousands of > unique strings *per document*. > > Yes, it's a bit silly and extreme, and maybe the compression would not > help much in this case. But it shows that 2**16 is damn easy to hit. > > In other words, this seems like a nice example of survivor bias, where > we only look at cases for which the existing limitations are acceptable, > ignoring the (many) remaining cases eliminated by those limitations. > >
I don't think we should lightly discard the use of 2 byte keys though. Maybe we could use a scheme similar to what we use for text lengths, where the first bit indicates whether we have a 1 byte or 4 byte length indicator. Many dictionaries will have less that 2^15-1 entries, so they would use exclusively the smaller keys. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com