On 02/11/2018 10:06 PM, Thomas Munro wrote: > On Mon, Feb 12, 2018 at 12:24 PM, Andrew Dunstan > <andrew.duns...@2ndquadrant.com> wrote: >> On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> Andrew Kane <and...@chartkick.com> writes: >>>> A better option could be a new "dynamic enum" type, which would have >>>> similar storage requirements as an enum, but instead of labels being >>>> declared ahead of time, they would be added as data is inserted. >>> >>> You realize, of course, that it's possible to add labels to an enum type >>> today. (Removing them is another story.) >>> >>> You haven't explained exactly what you have in mind that is going to be >>> able to duplicate the advantages of the current enum implementation >>> without its disadvantages, so it's hard to evaluate this proposal. >>> >> >> >> This sounds rather like the idea I have been tossing around in my head >> for a while, and in sporadic discussions with a few people, for a >> dictionary object. The idea is to have an append-only list of labels >> which would not obey transactional semantics, and would thus help us >> avoid the pitfalls of enums - there wouldn't be any rollback of an >> addition. The use case would be for a jsonb representation which >> would replace object keys with the oid value of the corresponding >> dictionary entry rather like enums now. We could have a per-table >> dictionary which in most typical json use cases would be very small, >> and we know from some experimental data that the compression in space >> used from such a change would often be substantial. >> >> This would have to be modifiable dynamically rather than requiring >> explicit additions to the dictionary, to be of practical use for the >> jsonb case, I believe. >> >> I hadn't thought about this as a sort of super enum that was usable >> directly by users, but it makes sense. >> >> I have no idea how hard or even possible it would be to implement. > > I have had thoughts over the years about something similar, but going > the other way and hiding it from the end user. If you could declare a > column to have a special compressed property (independently of the > type) then it could either automatically maintain a dictionary, or at > least build a new dictionary for your when you next run some kind of > COMPRESS operation. There would be no user visible difference except > footprint. In ancient DB2 they had a column property along those > lines called "VALUE COMPRESSION" (they also have a row-level version, > and now they have much more advanced kinds of adaptive compression > that I haven't kept up with). In some ways it'd be a bit like toast > with shared entries, but I haven't seriously looked into how such a > thing might be implemented.
For what it is worth, there is a similar concept in R called "factors". When categorical data is stored in a data.frame (R language equivalent to relations) it is transparently and automatically converted. I believe this is both for storage compression and to facilitate some of the analytics. In R you can also explicitly specify to *not* convert strings to factors as a performance optimization, because that conversion does have a noticeable impact during ingestion and is not always needed. I can also envision usefulness of this type of mechanism in other security related scenarios. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
signature.asc
Description: OpenPGP digital signature