Re: A space-efficient, user-friendly way to store categorical data

Thomas Munro Sun, 11 Feb 2018 22:08:29 -0800

On Mon, Feb 12, 2018 at 12:24 PM, Andrew Dunstan
<andrew.duns...@2ndquadrant.com> wrote:
> On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Andrew Kane <and...@chartkick.com> writes:
>>> A better option could be a new "dynamic enum" type, which would have
>>> similar storage requirements as an enum, but instead of labels being
>>> declared ahead of time, they would be added as data is inserted.
>>
>> You realize, of course, that it's possible to add labels to an enum type
>> today.  (Removing them is another story.)
>>
>> You haven't explained exactly what you have in mind that is going to be
>> able to duplicate the advantages of the current enum implementation
>> without its disadvantages, so it's hard to evaluate this proposal.
>>
>
>
> This sounds rather like the idea I have been tossing around in my head
> for a while, and in sporadic discussions with a few people, for a
> dictionary object. The idea is to have an append-only list of labels
> which would not obey transactional semantics, and would thus help us
> avoid the pitfalls of enums - there wouldn't be any rollback of an
> addition.  The use case would be for a jsonb representation which
> would replace object keys with the oid value of the corresponding
> dictionary entry rather like enums now. We could have a per-table
> dictionary which in most typical json use cases would be very small,
> and we know from some experimental data that the compression in space
> used from such a change would often be substantial.
>
> This would have to be modifiable dynamically rather than requiring
> explicit additions to the dictionary, to be of practical use for the
> jsonb case, I believe.
>
> I hadn't thought about this as a sort of super enum that was usable
> directly by users, but it makes sense.
>
> I have no idea how hard or even possible it would be to implement.


I have had thoughts over the years about something similar, but going
the other way and hiding it from the end user.  If you could declare a
column to have a special compressed property (independently of the
type) then it could either automatically maintain a dictionary, or at
least build a new dictionary for your when you next run some kind of
COMPRESS operation.  There would be no user visible difference except
footprint.  In ancient DB2 they had a column property along those
lines called "VALUE COMPRESSION" (they also have a row-level version,
and now they have much more advanced kinds of adaptive compression
that I haven't kept up with).  In some ways it'd be a bit like toast
with shared entries, but I haven't seriously looked into how such a
thing might be implemented.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: A space-efficient, user-friendly way to store categorical data

Reply via email to