A space-efficient, user-friendly way to store categorical data

2018-02-11 Thread Andrew Kane
Hi,

I'm hoping to get feedback on an idea for a new data type to allow for
efficient storage of text values while keeping reads and writes
user-friendly. Suppose you want to store categorical data like current city
for users. There will be a long list of cities, and many users will have
the same city. Some options are:

- Use a text column
- Use an enum column - saves space, but labels must be set ahead of time
- Create another table for cities (normalize) - saves space, but
complicates reads and writes

A better option could be a new "dynamic enum" type, which would have
similar storage requirements as an enum, but instead of labels being
declared ahead of time, they would be added as data is inserted.

It'd be great to hear what others think of this (or if I'm missing
something). Another direction could be to deduplicate values for TOAST-able
data types.

Thanks,
Andrew


Re: A space-efficient, user-friendly way to store categorical data

2018-02-12 Thread Andrew Kane
Thanks everyone for the feedback. The current enum implementation requires
you to create a new type and add labels outside a transaction prior to an
insert.

-- on table creation
CREATE TYPE city AS ENUM ();
CREATE TABLE "users" ("city" city);

-- on insert
ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago';
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

What would be ideal:

-- on table creation
CREATE TABLE "users" ("city" dynamic_enum);

-- on insert
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

Since enums have a fixed number of labels, this type of feature may be
better off as a property you could add to text columns (as Thomas
mentions). This would avoid issues with hitting the max number of labels.

- Andrew

On Mon, Feb 12, 2018 at 4:06 PM, Mark Dilger 
wrote:

>
> > On Feb 10, 2018, at 7:46 PM, Andrew Kane  wrote:
> >
> > Hi,
> >
> > I'm hoping to get feedback on an idea for a new data type to allow for
> efficient storage of text values while keeping reads and writes
> user-friendly. Suppose you want to store categorical data like current city
> for users. There will be a long list of cities, and many users will have
> the same city. Some options are:
> >
> > - Use a text column
> > - Use an enum column - saves space, but labels must be set ahead of time
> > - Create another table for cities (normalize) - saves space, but
> complicates reads and writes
> >
> > A better option could be a new "dynamic enum" type, which would have
> similar storage requirements as an enum, but instead of labels being
> declared ahead of time, they would be added as data is inserted.
> >
> > It'd be great to hear what others think of this (or if I'm missing
> something). Another direction could be to deduplicate values for TOAST-able
> data types.
> >
>
> I have already done this and use it in production.  My implementation is
> specific to my needs, and would have to be made more general purpose if
> it were ever to be accepted by this community, but I'll lay out the
> basic design for your consideration.
>
> I maintain multiple mappings of text to/from Oid.
>
> Some of the names, typically the most common ones that I am likely to
> encounter, are known at compile time.  I have a quick hardcoded lookup
> that maps these names to their compile time assigned Oid.  Conversions
> from the name to Oid or visa versa happen without entailing any shared
> lock overhead.
>
> For each mapping, I maintain a separate catalog table.  The compile time
> known values are inserted into the catalog with DATA lines, per the
> usual mechanism for populating catalog tables during bootstrap.
>
> For now, keeping the functions that map Oids to and from compile time
> known strings synchronized with the DATA lines is a manual PITA process,
> except that I almost never need to extend the mappings, and as such have
> not yet bothered to write a perl script for managing it.
>
> For each mapping, I also have a cache.  See syscache.h.  I have
> functions defined that convert between names and Oids that check the
> hardcoded values, then the cache, then the full catalog table if not
> found in the cache.  It all works transactionally and plays well with
> other processes that are adding or looking up values simultaneously.
> There are more than just 'get' and 'set' functions defined, because
> sometimes when you 'get' a value, you want it to be assigned an entry if
> it does not exist yet, and sometimes you don't.
>
> For a few of the mappings, where I know the list will never get too
> large, I map from text to one or two bytes rather than to a full four
> byte Oid.  This helps, because I have other tables with multiple columns
> that are mappings of this sort, such that I can pack two or three
> mapping columns into just 4 bytes.  Whether you'd get a similar payoff
> depends a lot on your table structure.  Since I'm using syscache, I have
> to store a full four bytes there, but the tables that reference that
> don't have to spend a full four bytes.  To make the conversion easier, I
> have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid
> range above 1 that is reserved for their use, with a mapping to/from
> the lower range [0..255] or [0..65535] automatically performed when
> casting to/from Oid.
>
> The main reason I prefer this over the built-in enum type is that for
> the data distribution I am handling, I have greater than 90% of the
> lookups succeed without looking beyond the compile-time known values.
> (For some mappings, it's 

Re: A space-efficient, user-friendly way to store categorical data

2018-02-12 Thread Andrew Kane
They'd refer to separate enums.

I originally thought an enum was a good comparison for this feature, but
I'm no longer sure that it is. A text-based ordering would be desired
rather than the label index.

A better comparison may be a two-column lookup table:

-- create
CREATE TABLE cities (id bigserial primary key, name text)
CREATE UNIQUE INDEX ON cities (name);
CREATE TABLE users (city_id bigint);

-- insert
BEGIN;
INSERT INTO cities (name) VALUES ('Chicago') ON CONFLICT (name) DO NOTHING
RETURNING id;
INSERT INTO users (city_id) VALUES ();
COMMIT;

-- select
SELECT * FROM users FROM users INNER JOIN cities ON cities.id =
users.city_id WHERE name = 'Chicago';


Ideally, the lookup table could be maintained by Postgres to make reads and
writes easier.

-- create
CREATE TABLE users (city text DEDUPED);

-- insert
INSERT INTO users (city) VALUES ('Chicago');

-- query
SELECT * FROM users WHERE city = 'Chicago';

I'm not really sure the best place to store this lookup table.

- Andrew

On Mon, Feb 12, 2018 at 7:11 PM, Mark Dilger 
wrote:

>
> > On Feb 12, 2018, at 6:35 PM, Tom Lane  wrote:
> >
> > Andrew Kane  writes:
> >> Thanks everyone for the feedback. The current enum implementation
> requires
> >> you to create a new type and add labels outside a transaction prior to
> an
> >> insert.
> >
> > Right ...
> >
> >> Since enums have a fixed number of labels, this type of feature may be
> >> better off as a property you could add to text columns (as Thomas
> >> mentions). This would avoid issues with hitting the max number of
> labels.
> >
> > ... but you're not saying how you'd avoid the need for prior commit of
> the
> > labels.  The sticking point for enums is that once a value has gotten
> into
> > a btree index, we can't ever lose the ability to compare that value to
> > others, or the index will be broken.  So inserting an uncommitted value
> > into user tables has to be prevented.
> >
> > Maybe there's a way to assign the labels so that they can be compared
> > without reference to any outside data, but it's not clear to me how
> > that would work.
>
> When I implemented this, I wrote the comparators to work on the Oid for
> the value, not the string representation.  That works fine.  If you want to
> sort the data on the stringified version, cast to text first.  That works
> well
> enough for me, since I'm typically not interested in what sort order is
> used,
> as long as it is deterministic and works for indexing, group by, and so
> forth.
>
>


Exporting float_to_shortest_decimal_buf(n) with Postgres 17 on Windows

2024-09-13 Thread Andrew Kane
Hi,

With Postgres 17 RC1 on Windows, `float_to_shortest_decimal_buf` and
`float_to_shortest_decimal_bufn` are not longer exported. This causes
`unresolved external symbol` linking errors for extensions that rely on
these functions (like pgvector). Can these functions be exported like
previous versions of Postgres?

Thanks,
Andrew


Re: Exporting float_to_shortest_decimal_buf(n) with Postgres 17 on Windows

2024-09-13 Thread Andrew Kane
> Probably a Windows thing?

Correct, it's only on Windows.

> I do see a fair amount of special handling for f2s.c in the build files.
 I
wonder if something got broken for Windows in the switch from the MSVC
scripts to meson.

This was my hunch as well since none of the source files changed. Also,
neither function is present with `dumpbin /EXPORTS /SYMBOLS
lib\postgres.lib`, which led me to believe it may need to be addressed
upstream.

- Andrew

On Fri, Sep 13, 2024 at 2:41 PM Nathan Bossart 
wrote:

> On Fri, Sep 13, 2024 at 04:58:20PM -0400, Tom Lane wrote:
> > Andrew Kane  writes:
> >> With Postgres 17 RC1 on Windows, `float_to_shortest_decimal_buf` and
> >> `float_to_shortest_decimal_bufn` are not longer exported. This causes
> >> `unresolved external symbol` linking errors for extensions that rely on
> >> these functions (like pgvector). Can these functions be exported like
> >> previous versions of Postgres?
> >
> > AFAICS it's in the exact same place it was in earlier versions.
> > You might need to review your linking commands.
>
> I do see a fair amount of special handling for f2s.c in the build files.  I
> wonder if something got broken for Windows in the switch from the MSVC
> scripts to meson.
>
> --
> nathan
>


ORDER BY operator index scans and filtering

2024-09-28 Thread Andrew Kane
Hi,

Currently, index scans that order by an operator (for instance, `location
<-> POINT(0, 0)`) and have a filter for the same expression (`location <->
POINT(0, 0) < 2`) can end up scanning much more of the index than is
necessary.

Here's a complete example:

CREATE TABLE stores (location point);
INSERT INTO stores SELECT POINT(0, i) FROM generate_series(1, 10) i;
CREATE INDEX ON stores USING gist (location);
EXPLAIN (ANALYZE, COSTS OFF) SELECT * FROM stores WHERE location <->
POINT(0, 0) < 2 ORDER BY location <-> POINT(0, 0) LIMIT 10;

Once the second tuple returned from the index has a distance >= 2, the scan
should be able to end (as it's an ascending order scan). Instead, it scans
the entire index, filtering out the next 99,998 rows.

 Limit (actual time=0.166..32.573 rows=1 loops=1)
   ->  Index Only Scan using stores_location_idx on stores (actual
time=0.165..32.570 rows=1 loops=1)
 Order By: (location <-> '(0,0)'::point)
 Filter: ((location <-> '(0,0)'::point) < '2'::double precision)
 Rows Removed by Filter: 9

This can be especially costly for vector index scans (this was found while
working on an upcoming feature for pgvector [1]).

- Andrew

[1] https://github.com/pgvector/pgvector/issues/678


Re: Restore support for USE_ASSERT_CHECKING in extensions only

2025-01-10 Thread Andrew Kane
Thank you both!


Re: Restore support for USE_ASSERT_CHECKING in extensions only

2025-01-10 Thread Andrew Kane
I've updated the patch to make verify_compact_attribute a no-op.

The extension sets USE_ASSERT_CHECKING, which is why the macro approach
doesn't work (it won't take that path).

Also, it looks like it fails when creating the extension / loading the
shared library (on Ubuntu), not when linking (as I misstated earlier).
From 90aab02fd882b0fd7fb8df96c80022bf64a720c5 Mon Sep 17 00:00:00 2001
From: Andrew Kane 
Date: Fri, 10 Jan 2025 15:41:42 -0800
Subject: [PATCH v2] Restore support for USE_ASSERT_CHECKING in extensions only

---
 src/backend/access/common/tupdesc.c | 5 ++---
 src/include/access/tupdesc.h| 2 --
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index cc94074219..2e4666c469 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -118,7 +118,6 @@ populate_compact_attribute(TupleDesc tupdesc, int attnum)
 	populate_compact_attribute_internal(src, dst);
 }
 
-#ifdef USE_ASSERT_CHECKING
 /*
  * verify_compact_attribute
  *		In Assert enabled builds, we verify that the CompactAttribute is
@@ -132,6 +131,7 @@ populate_compact_attribute(TupleDesc tupdesc, int attnum)
 void
 verify_compact_attribute(TupleDesc tupdesc, int attnum)
 {
+#ifdef USE_ASSERT_CHECKING
 	CompactAttribute *cattr = &tupdesc->compact_attrs[attnum];
 	Form_pg_attribute attr = TupleDescAttr(tupdesc, attnum);
 	CompactAttribute tmp;
@@ -150,9 +150,8 @@ verify_compact_attribute(TupleDesc tupdesc, int attnum)
 
 	/* Check the freshly populated CompactAttribute matches the TupleDesc's */
 	Assert(memcmp(&tmp, cattr, sizeof(CompactAttribute)) == 0);
-}
 #endif
-
+}
 
 /*
  * CreateTemplateTupleDesc
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index aee871b0e8..504ce22250 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -158,9 +158,7 @@ TupleDescAttr(TupleDesc tupdesc, int i)
 
 #undef TupleDescAttrAddress
 
-#ifdef USE_ASSERT_CHECKING
 extern void verify_compact_attribute(TupleDesc, int attnum);
-#endif
 
 /*
  * Accessor for the i'th CompactAttribute element of tupdesc.
-- 
2.45.0



Restore support for USE_ASSERT_CHECKING in extensions only

2025-01-10 Thread Andrew Kane
Prior to 6f3820f, extensions could be compiled with -DUSE_ASSERT_CHECKING
whether or not the Postgres installation was configured with
--enable-cassert (to enable at least some assertion checking). However,
after 6f3820f, linking fails with `undefined symbol:
verify_compact_attribute`. I'm not sure if this use case is officially
supported, but it has been helpful for developing pgvector.

The attached patch fixes it, but there may be a better way to do this.


v1-0001-Restore-support-for-USE_ASSERT_CHECKING-in-extens.patch
Description: Binary data