> On Feb 10, 2018, at 7:46 PM, Andrew Kane <and...@chartkick.com> wrote: > > Hi, > > I'm hoping to get feedback on an idea for a new data type to allow for > efficient storage of text values while keeping reads and writes > user-friendly. Suppose you want to store categorical data like current city > for users. There will be a long list of cities, and many users will have the > same city. Some options are: > > - Use a text column > - Use an enum column - saves space, but labels must be set ahead of time > - Create another table for cities (normalize) - saves space, but complicates > reads and writes > > A better option could be a new "dynamic enum" type, which would have similar > storage requirements as an enum, but instead of labels being declared ahead > of time, they would be added as data is inserted. > > It'd be great to hear what others think of this (or if I'm missing > something). Another direction could be to deduplicate values for TOAST-able > data types. >
I have already done this and use it in production. My implementation is specific to my needs, and would have to be made more general purpose if it were ever to be accepted by this community, but I'll lay out the basic design for your consideration. I maintain multiple mappings of text to/from Oid. Some of the names, typically the most common ones that I am likely to encounter, are known at compile time. I have a quick hardcoded lookup that maps these names to their compile time assigned Oid. Conversions from the name to Oid or visa versa happen without entailing any shared lock overhead. For each mapping, I maintain a separate catalog table. The compile time known values are inserted into the catalog with DATA lines, per the usual mechanism for populating catalog tables during bootstrap. For now, keeping the functions that map Oids to and from compile time known strings synchronized with the DATA lines is a manual PITA process, except that I almost never need to extend the mappings, and as such have not yet bothered to write a perl script for managing it. For each mapping, I also have a cache. See syscache.h. I have functions defined that convert between names and Oids that check the hardcoded values, then the cache, then the full catalog table if not found in the cache. It all works transactionally and plays well with other processes that are adding or looking up values simultaneously. There are more than just 'get' and 'set' functions defined, because sometimes when you 'get' a value, you want it to be assigned an entry if it does not exist yet, and sometimes you don't. For a few of the mappings, where I know the list will never get too large, I map from text to one or two bytes rather than to a full four byte Oid. This helps, because I have other tables with multiple columns that are mappings of this sort, such that I can pack two or three mapping columns into just 4 bytes. Whether you'd get a similar payoff depends a lot on your table structure. Since I'm using syscache, I have to store a full four bytes there, but the tables that reference that don't have to spend a full four bytes. To make the conversion easier, I have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid range above 10000 that is reserved for their use, with a mapping to/from the lower range [0..255] or [0..65535] automatically performed when casting to/from Oid. The main reason I prefer this over the built-in enum type is that for the data distribution I am handling, I have greater than 90% of the lookups succeed without looking beyond the compile-time known values. (For some mappings, it's really more like 100%, with the possibility of run-time added values being a theoretical possibility that I don't want to exclude, but as yet has never happened in practice.) The primary drawback of my implementation from the point of view of making it general purpose is that you need to know something about your data distribution before you compile postgres. AFAIK, most postgres users don't compile their own copy, but use an RPM or some such, and that puts my type of solution out of reach for them. The second most obvious problem with my implementation is that it lacks any means for removing values from the mappings. Once added, they are never removed. For me, this is a good thing, because it means I don't need to create indexes referencing a master table of mappings to guarantee data integrity, and for my workload, none of the mappings are the kind of thing where mappings data would ever get expired. I am in a really small niche situation that would likely not be applicable for most other users. Extending this to help the community as a whole in the form of something general purpose seems difficult, which is why I never tried to contribute it. Could you perhaps tell me how similar your situation is to mine, and whether my approach is anything like what you were contemplating? I am curious to hear an answer to Tom's question, upthread, about why you would not just use the enumeration type that postgres already supplies. Mark Dilger