On 8/21/24 10:29 AM, Robert Haas wrote:

I went ahead and committed these patches. I know there's some debate
over whether we want to show the # of disabled nodes and if so whether
it should be controlled by COSTS, and I suspect I haven't completely
allayed David's concerns about the initial_cost_XXX functions although
I think that I did the right thing. But, I don't have the impression
that anyone is desperately opposed to the basic concept, so I think it
makes sense to put these into the tree and see what happens. We have
quite a bit of time left in this release cycle to uncover bugs, hear
from users or other developers, etc. about what problems there may be
with this. If we end up deciding to reverse course or need to fix a
bunch of stuff, so be it, but let's see what the feedback is.

We hit an issue with pgvector[0] where a regular `SELECT count(*) FROM table`[1] is attempting to scan the index on the vector column when `enable_seqscan` is disabled. Credit to Andrew Kane (CC'd) for flagging it.

I was able to trace this back to e2225346. Here is a reproducer:

Setup
=====

CREATE EXTENSION vector;

CREATE OR REPLACE FUNCTION public.generate_random_normalized_vector(dim integer)
RETURNS vector
LANGUAGE SQL
AS $$
    SELECT public.l2_normalize(array_agg(random()::real)::vector)
    FROM generate_series(1, $1);
$$;

CREATE TABLE test (id int, embedding vector(128));
INSERT INTO test
  SELECT n, public.generate_random_normalized_vector(128)
  FROM generate_series(1,5) n;

CREATE INDEX ON test USING hnsw (embedding vector_cosine_ops);

Test
====

SET enable_seqscan TO off;
EXPLAIN ANALYZE
SELECT count(*) FROM test;

Before e2225346:
----------------

Aggregate (cost=10000041965.00..10000041965.01 rows=1 width=8) (actual time=189.864..189.864 rows
=1 loops=1)
-> Seq Scan on test (cost=10000000000.00..10000040715.00 rows=5 width=0) (actual time=0.01
8..168.294 rows=5 loops=1)
(4 rows)

With e2225346:
-------------
ERROR:  cannot scan hnsw index without order

Some things to note with the ivfflat/hnsw index AMs[3] in pgvector are that they're used for "ORDER BY" scans exclusively. They currently don't support index only scans (noting as I tried reproducing the issue with GIST and couldn't do so because of that), but we wouldn't want to do a full table "count(*)" on a IVFFlat/HNSW index anyway as it'd be more expensive than just a full table scan.

Thanks,

Jonathan

[0] https://github.com/pgvector/pgvector
[1] https://github.com/pgvector/pgvector/actions/runs/10519052945
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e2225346
[3] https://github.com/pgvector/pgvector/blob/master/src/hnsw.c#L192

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to