Functionally dependent columns in SELECT DISTINCT

2024-09-12 Thread Willow Chargin
Hello! Postgres lets us omit columns from a GROUP BY clause if they are
functionally dependent on a grouped key, which is a nice quality-of-life
feature. I'm wondering if a similar relaxation could be permitted for
the SELECT DISTINCT list?

I have a query where I want to find the most recent few items from a
table that match some complex condition, where the condition involves
joining other tables. Here's an example, with two approaches:

-- Store some data: an "item" has one or more "parts".
CREATE TABLE items(id int PRIMARY KEY, create_time timestamptz);
CREATE TABLE parts(part_id int PRIMARY KEY, item_id int);
INSERT INTO items(id, create_time)
SELECT i, now() - make_interval(secs => i)
FROM generate_series(1, 100) s(i);
INSERT INTO parts(item_id, part_id)
SELECT items.id, 2 * items.id + delta
FROM items, (VALUES(0), (1)) delta(delta);
CREATE INDEX ON items(create_time DESC);
CREATE INDEX ON parts(item_id);
ANALYZE items, parts;

-- Suppose we want to find the most recent few items with a part
-- whose part ID is threeven. Two approaches:

-- SELECT DISTINCT: fast, but superfluous column:
EXPLAIN ANALYZE
SELECT DISTINCT items.id, create_time
FROM items JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
ORDER BY create_time DESC
LIMIT 5;
-- 4ms:
--   parallel index scan on items_create_time_idx
--   -> nested loop index scan parts_item_id_idx
--   -> incremental sort -> gather merge -> unique -> limit

-- GROUP BY: slow, but functional dependency recognized:
EXPLAIN ANALYZE
SELECT items.id
FROM items JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
GROUP BY items.id
ORDER BY create_time DESC
LIMIT 5;
-- 400ms:
--   parallel seq scan on parts
--   -> parallel hash join on item_id via seq scan on items
--   -> sort -> group -> gather merge -> group -> sort -> limit

These timings are Postgres 14.5 on a Linux i7-1165G7. With Postgres 16.3
on an Apple M3 Pro, the shape is the same: the GROUP BY is about 300ms,
and the SELECT DISTINCT is way faster still, at 0.07ms. (It declines to
parallelize, which seems to help.)

I want to use the faster approach, and it works without issue so far.
But that extra column in the SELECT list is a bit inconvenient.

My questions are:

  - Do I understand right that these kinds of queries are equivalent?

  - If so, does the SQL standard permit Postgres to recognize functional
dependency in this case, so that users may omit the order column
column from the `SELECT DISTINCT` list? (I don't have a copy of the
standard to check myself.)

  - Might future versions of Postgres allow this?

thanks!
~Willow




Re: Functionally dependent columns in SELECT DISTINCT

2024-09-13 Thread Willow Chargin
On Thu, Sep 12, 2024 at 11:13 PM  wrote:
>
> What about using DISTINCT ON () ?
> SELECT DISTINCT ON (items.id) items.*
> FROM items
>   JOIN parts ON items.id = parts.item_id
> WHERE part_id % 3 = 0
> ORDER BY items.id,items.create_time DESC
> LIMIT 5;
>
> This gives me this plan: https://explain.depesz.com/s/QHr6 on 16.2  (Windows, 
> i7-1260P)

Ordering by items.id changes the answer, though. In the example I gave,
items.id and items.create_time happened to be in the same order, but
that needn't hold. In reality I really do want the ID columns of the
*most recent* items.

You can see the difference if you build the test dataset a bit
differently:

INSERT INTO items(id, create_time)
SELECT i, now() - make_interval(secs => random() * 1e6)
FROM generate_series(1, 100) s(i);

We want the returned create_times to be all recent, and the IDs now
should look roughly random.




Re: Functionally dependent columns in SELECT DISTINCT

2024-09-13 Thread Willow Chargin
Thanks both for your suggestions so far.

On Fri, Sep 13, 2024 at 8:43 AM David G. Johnston
 wrote:
>
> On Friday, September 13, 2024, Willow Chargin  wrote:
>>
>> In reality I really do want the ID columns of the
>> *most recent* items.
>
>
> Use a window function to rank them and pull out rank=1

Hmm, like this? noting that it's rank<=5, not rank=1:

-- 1. rank all item-part combinations, densely since an item may
  have multiple parts
-- 2. limit by rank, still retaining multiple copies of each item
-- 3. de-duplicate IDs
SELECT DISTINCT id FROM (
SELECT id, dense_rank FROM (
SELECT
items.id,
dense_rank() OVER (ORDER BY create_time DESC)
FROM items JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
) q
WHERE dense_rank <= 5
) q

I've done this before, but my experience is that it's usually far slower
because the rank is computed eagerly even for rows that don't match the
rank bound. And indeed here it takes 20% longer than even the slower
GROUP BY from before: https://explain.depesz.com/s/mQIi

> or use a lateral subquery to surgically (fetch first 1) retrieve the first 
> row when sorted by recency descending.

I'm not sure that I see how to apply this when I need top-k, not top-1.




Re: re-novice coming back to pgsql: porting an SQLite update statement to postgres

2024-09-15 Thread Willow Chargin
On Sun, Sep 15, 2024 at 4:23 AM Alban Hertroys  wrote:
>
> > On 15 Sep 2024, at 11:07, Dan Kortschak  wrote:
> >
> > I have come to hopefully my last stumbling point.
> >
> > I am unable to see a way to express something like this SQLite syntax
> >
> > select json_group_array(json_replace(value,
> >  '$.a', case
> >when json_extract(value, '$.a') > 2 then
> >  2
> >else
> >  json_extract(value, '$.a')
> >end,
> >  '$.b', case
> >when json_extract(value, '$.b') < -2 then
> >  -2
> >else
> >  json_extract(value, '$.b')
> >end
> > ))
> > from
> >  json_each('[{"a":1, "b":-3},{"a":2, "b":-2},{"a":3, "b":-1}]');
>
> [...]
>
> I see basically two approaches. One is to take the objects apart [...]
>
> with t as (
> select jsonb($$[{"a":1, "b":-3, "c":1},{"a":2, "b":-2, "c":2},{"a":3, 
> "b":-1, "c":3},{"a":3, "b":-3, "c":4}]$$) arr
> )
> select jsonb_agg(jsonb_build_object(
> 'a', case when records.a > 2 then 2 else records.a end
> ,   'b', case when records.b < -2 then -2 else records.b end
> ,   'c', c
> ))
> from t
> cross join lateral jsonb_to_recordset(t.arr) records(a int, b int, c int)
> ;
>
> [...]
>
> The drawback is that you have to specify all fields and types, but you don’t 
> need to cast the values all the time either.

Here is a variant of Alban's first method that does not require
specifying all fields and types, and so works with heterogeneous values:

WITH t AS (
SELECT jsonb($$[
{"a": 1, "b": -3, "c": 1},
{"a": 2, "b": -2, "c": 2},
{"a": 3, "b": -1, "c": 3},
{"a": 3, "b": -3, "c": 4}
]$$) arr
)
SELECT
jsonb_agg(new_element ORDER BY idx) new_arr
FROM t, LATERAL (
SELECT idx, jsonb_object_agg(key, CASE
WHEN key = 'a'
THEN least(old_value::numeric, 2)::text::jsonb
WHEN key = 'b'
THEN greatest(old_value::numeric, -2)::text::jsonb
ELSE old_value
END)
FROM
jsonb_array_elements(arr)
WITH ORDINALITY old_elements(old_element, idx),
jsonb_each(old_element) each(key, old_value)
GROUP BY idx
) new_elements(idx, new_element)

I also took the liberties of using `least` / `greatest` to simplify the
clamping operations, and using `WITH ORDINALITY` / `ORDER BY` on the
array scan and re-aggregation to make the element ordering explicit
rather than relying on the query engine to not re-order the rows.

https://www.postgresql.org/docs/16/functions-conditional.html#FUNCTIONS-GREATEST-LEAST
https://www.postgresql.org/docs/16/queries-table-expressions.html#QUERIES-TABLEFUNCTIONS