Re: A space-efficient, user-friendly way to store categorical data

Andrew Kane Mon, 12 Feb 2018 17:26:44 -0800

Thanks everyone for the feedback. The current enum implementation requires
you to create a new type and add labels outside a transaction prior to an
insert.


-- on table creation
CREATE TYPE city AS ENUM ();
CREATE TABLE "users" ("city" city);

-- on insert
ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago';
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

What would be ideal:

-- on table creation
CREATE TABLE "users" ("city" dynamic_enum);

-- on insert
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

Since enums have a fixed number of labels, this type of feature may be
better off as a property you could add to text columns (as Thomas
mentions). This would avoid issues with hitting the max number of labels.

- Andrew

On Mon, Feb 12, 2018 at 4:06 PM, Mark Dilger <hornschnor...@gmail.com>
wrote:

>
> > On Feb 10, 2018, at 7:46 PM, Andrew Kane <and...@chartkick.com> wrote:
> >
> > Hi,
> >
> > I'm hoping to get feedback on an idea for a new data type to allow for
> efficient storage of text values while keeping reads and writes
> user-friendly. Suppose you want to store categorical data like current city
> for users. There will be a long list of cities, and many users will have
> the same city. Some options are:
> >
> > - Use a text column
> > - Use an enum column - saves space, but labels must be set ahead of time
> > - Create another table for cities (normalize) - saves space, but
> complicates reads and writes
> >
> > A better option could be a new "dynamic enum" type, which would have
> similar storage requirements as an enum, but instead of labels being
> declared ahead of time, they would be added as data is inserted.
> >
> > It'd be great to hear what others think of this (or if I'm missing
> something). Another direction could be to deduplicate values for TOAST-able
> data types.
> >
>
> I have already done this and use it in production.  My implementation is
> specific to my needs, and would have to be made more general purpose if
> it were ever to be accepted by this community, but I'll lay out the
> basic design for your consideration.
>
> I maintain multiple mappings of text to/from Oid.
>
> Some of the names, typically the most common ones that I am likely to
> encounter, are known at compile time.  I have a quick hardcoded lookup
> that maps these names to their compile time assigned Oid.  Conversions
> from the name to Oid or visa versa happen without entailing any shared
> lock overhead.
>
> For each mapping, I maintain a separate catalog table.  The compile time
> known values are inserted into the catalog with DATA lines, per the
> usual mechanism for populating catalog tables during bootstrap.
>
> For now, keeping the functions that map Oids to and from compile time
> known strings synchronized with the DATA lines is a manual PITA process,
> except that I almost never need to extend the mappings, and as such have
> not yet bothered to write a perl script for managing it.
>
> For each mapping, I also have a cache.  See syscache.h.  I have
> functions defined that convert between names and Oids that check the
> hardcoded values, then the cache, then the full catalog table if not
> found in the cache.  It all works transactionally and plays well with
> other processes that are adding or looking up values simultaneously.
> There are more than just 'get' and 'set' functions defined, because
> sometimes when you 'get' a value, you want it to be assigned an entry if
> it does not exist yet, and sometimes you don't.
>
> For a few of the mappings, where I know the list will never get too
> large, I map from text to one or two bytes rather than to a full four
> byte Oid.  This helps, because I have other tables with multiple columns
> that are mappings of this sort, such that I can pack two or three
> mapping columns into just 4 bytes.  Whether you'd get a similar payoff
> depends a lot on your table structure.  Since I'm using syscache, I have
> to store a full four bytes there, but the tables that reference that
> don't have to spend a full four bytes.  To make the conversion easier, I
> have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid
> range above 10000 that is reserved for their use, with a mapping to/from
> the lower range [0..255] or [0..65535] automatically performed when
> casting to/from Oid.
>
> The main reason I prefer this over the built-in enum type is that for
> the data distribution I am handling, I have greater than 90% of the
> lookups succeed without looking beyond the compile-time known values.
> (For some mappings, it's really more like 100%, with the possibility
> of run-time added values being a theoretical possibility that I don't
> want to exclude, but as yet has never happened in practice.)
>
> The primary drawback of my implementation from the point of view of
> making it general purpose is that you need to know something about your
> data distribution before you compile postgres.  AFAIK, most postgres
> users don't compile their own copy, but use an RPM or some such, and
> that puts my type of solution out of reach for them.
>
> The second most obvious problem with my implementation is that it lacks
> any means for removing values from the mappings.  Once added, they are
> never removed.  For me, this is a good thing, because it means I don't
> need to create indexes referencing a master table of mappings to
> guarantee data integrity, and for my workload, none of the mappings are
> the kind of thing where mappings data would ever get expired.
>
> I am in a really small niche situation that would likely not be
> applicable for most other users.
>
> Extending this to help the community as a whole in the form of something
> general purpose seems difficult, which is why I never tried to
> contribute it.
>
> Could you perhaps tell me how similar your situation is to mine, and
> whether my approach is anything like what you were contemplating?  I am
> curious to hear an answer to Tom's question, upthread, about why you
> would not just use the enumeration type that postgres already supplies.
>
> Mark Dilger
>
>
>

Re: A space-efficient, user-friendly way to store categorical data

Reply via email to