Thanks everyone for the feedback. The current enum implementation requires you to create a new type and add labels outside a transaction prior to an insert.
-- on table creation CREATE TYPE city AS ENUM (); CREATE TABLE "users" ("city" city); -- on insert ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago'; BEGIN; INSERT INTO "users" ("city") VALUES ('Chicago'); COMMIT; What would be ideal: -- on table creation CREATE TABLE "users" ("city" dynamic_enum); -- on insert BEGIN; INSERT INTO "users" ("city") VALUES ('Chicago'); COMMIT; Since enums have a fixed number of labels, this type of feature may be better off as a property you could add to text columns (as Thomas mentions). This would avoid issues with hitting the max number of labels. - Andrew On Mon, Feb 12, 2018 at 4:06 PM, Mark Dilger <hornschnor...@gmail.com> wrote: > > > On Feb 10, 2018, at 7:46 PM, Andrew Kane <and...@chartkick.com> wrote: > > > > Hi, > > > > I'm hoping to get feedback on an idea for a new data type to allow for > efficient storage of text values while keeping reads and writes > user-friendly. Suppose you want to store categorical data like current city > for users. There will be a long list of cities, and many users will have > the same city. Some options are: > > > > - Use a text column > > - Use an enum column - saves space, but labels must be set ahead of time > > - Create another table for cities (normalize) - saves space, but > complicates reads and writes > > > > A better option could be a new "dynamic enum" type, which would have > similar storage requirements as an enum, but instead of labels being > declared ahead of time, they would be added as data is inserted. > > > > It'd be great to hear what others think of this (or if I'm missing > something). Another direction could be to deduplicate values for TOAST-able > data types. > > > > I have already done this and use it in production. My implementation is > specific to my needs, and would have to be made more general purpose if > it were ever to be accepted by this community, but I'll lay out the > basic design for your consideration. > > I maintain multiple mappings of text to/from Oid. > > Some of the names, typically the most common ones that I am likely to > encounter, are known at compile time. I have a quick hardcoded lookup > that maps these names to their compile time assigned Oid. Conversions > from the name to Oid or visa versa happen without entailing any shared > lock overhead. > > For each mapping, I maintain a separate catalog table. The compile time > known values are inserted into the catalog with DATA lines, per the > usual mechanism for populating catalog tables during bootstrap. > > For now, keeping the functions that map Oids to and from compile time > known strings synchronized with the DATA lines is a manual PITA process, > except that I almost never need to extend the mappings, and as such have > not yet bothered to write a perl script for managing it. > > For each mapping, I also have a cache. See syscache.h. I have > functions defined that convert between names and Oids that check the > hardcoded values, then the cache, then the full catalog table if not > found in the cache. It all works transactionally and plays well with > other processes that are adding or looking up values simultaneously. > There are more than just 'get' and 'set' functions defined, because > sometimes when you 'get' a value, you want it to be assigned an entry if > it does not exist yet, and sometimes you don't. > > For a few of the mappings, where I know the list will never get too > large, I map from text to one or two bytes rather than to a full four > byte Oid. This helps, because I have other tables with multiple columns > that are mappings of this sort, such that I can pack two or three > mapping columns into just 4 bytes. Whether you'd get a similar payoff > depends a lot on your table structure. Since I'm using syscache, I have > to store a full four bytes there, but the tables that reference that > don't have to spend a full four bytes. To make the conversion easier, I > have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid > range above 10000 that is reserved for their use, with a mapping to/from > the lower range [0..255] or [0..65535] automatically performed when > casting to/from Oid. > > The main reason I prefer this over the built-in enum type is that for > the data distribution I am handling, I have greater than 90% of the > lookups succeed without looking beyond the compile-time known values. > (For some mappings, it's really more like 100%, with the possibility > of run-time added values being a theoretical possibility that I don't > want to exclude, but as yet has never happened in practice.) > > The primary drawback of my implementation from the point of view of > making it general purpose is that you need to know something about your > data distribution before you compile postgres. AFAIK, most postgres > users don't compile their own copy, but use an RPM or some such, and > that puts my type of solution out of reach for them. > > The second most obvious problem with my implementation is that it lacks > any means for removing values from the mappings. Once added, they are > never removed. For me, this is a good thing, because it means I don't > need to create indexes referencing a master table of mappings to > guarantee data integrity, and for my workload, none of the mappings are > the kind of thing where mappings data would ever get expired. > > I am in a really small niche situation that would likely not be > applicable for most other users. > > Extending this to help the community as a whole in the form of something > general purpose seems difficult, which is why I never tried to > contribute it. > > Could you perhaps tell me how similar your situation is to mine, and > whether my approach is anything like what you were contemplating? I am > curious to hear an answer to Tom's question, upthread, about why you > would not just use the enumeration type that postgres already supplies. > > Mark Dilger > > >