On Fri, Aug 23, 2024 at 11:17 AM Jonathan S. Katz <jk...@postgresql.org> wrote: > We hit an issue with pgvector[0] where a regular `SELECT count(*) FROM > table`[1] is attempting to scan the index on the vector column when > `enable_seqscan` is disabled. Credit to Andrew Kane (CC'd) for flagging it. > > I was able to trace this back to e2225346. Here is a reproducer:
If I change EXPLAIN ANALYZE in this test to just EXPLAIN, I get this: Aggregate (cost=179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00..179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00 rows=1 width=8) -> Index Only Scan using test_embedding_idx on test (cost=179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00..179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00 rows=5 width=0) It took me a moment to wrap my head around this: the cost estimate is 312 decimal digits long. Apparently hnswcostestimate() just returns DBL_MAX when there are no scan keys because it really, really doesn't want to do that. Before e2225346, that kept this plan from being generated because it was (much) larger than disable_cost. But now it doesn't, because 1 disabled node makes a path more expensive than any possible non-disabled path. Since that was the whole point of the patch, I don't feel too bad about it. I find it a little weird that hnsw thinks itself able to return all the tuples in an order the user chooses, but unable to return all of the tuples in an arbitrary order. In core, we have precedent for index types that can't return individual tuples at all (gin, brin) but not one that is able to return tuples in concept but has a panic attack if you don't know how you want them sorted. I don't quite see why you couldn't just treat that case the same as ORDER BY the_first_column_of_the_index, or any other arbitrary rule that you want to make up. Sure, it might be more expensive than a sequential scan, but the user said they didn't want a sequential scan. I'm not quite sure why pgvector thinks it gets to decide that it knows better than the user, or the rest of the optimizer. I don't even think I really believe it would always be worse: I've seen cases where a table was badly bloated and mostly empty but its indexes were not bloated, and in that case an index scan can be a HUGE winner even though it would normally be a lot worse than a sequential scan. If you don't want to fix hnsw to work the way the core optimizer thinks it should, or if there's some reason it can't be done, alternatives might include (1) having the cost estimate function hack the count of disabled nodes and (2) adding some kind of core support for an index cost estimator refusing a path entirely. I haven't tested (1) so I don't know for sure that there are no issues, but I think we have to do all of our cost estimating before we can think about adding the path so I feel like there's a decent chance it would do what you want. Also, while I did take the initiative to download pgvector and compile it and hook up a debugger and figure out what was going on here, I'm not really too sure that's my job. I do think I have a responsibility to help maintainers of out-of-core extensions who have problems as a result of my commits, but I also think it's fair to hope that those maintainers will try to minimize the amount of time that I need to spend trying to read code that I did not write and do not maintain. Fortunately, this wasn't hard to figure out, but in a way that's kind of the point. That DBL_MAX hack was put there by somebody who must've understood that they were trying to use a very large cost to disable a certain path shape completely, and it seems to me that if that person had studied this case and the commit message for e2225346, they would have likely understood what had happened pretty quickly. Do you think that's an unfair feeling on my part? -- Robert Haas EDB: http://www.enterprisedb.com