On Fri, Nov 28, 2025 at 6:34 PM Tomas Vondra <[email protected]> wrote: > True. I started working on this with two assumptions: (a) we can detect > cases when this is guaranteed to be the optimal join order, and (b) it > would be cheap to do so. Maybe one of those assumptions is not correct.
Probably needs testing, at least. I wonder if there's any way that we could rule out the presence of row-count-inflating joins, or at least identify the joins that are definitely not row-count-inflating. I have a general feeling that we postpone some of the estimation work that the planner does for too long, and that moving some of it earlier would allow better decision-making. It's very common for a query to involve a driving table and then a bunch of joins to side tables that are all essentially independent of each other. That is, each of those joins will multiply the row count from the other side of the join by a constant that does not depend very much on the join order. I don't really know exactly what we could do that would enable us to notice that pattern and do something about it, but it feels like we're duplicating a lot of work For instance, consider a 4-way join between A, B, C, and D, where A is connected to the other three tables by join clauses, and those are the only join clauses. Let's say that the join to B is going to multiply the row count by 2, the join to C is going to multiply it by 1 (i.e. no change), and the join to D is going to multiply it by 0.1 (i.e. eliminate 90% of rows). Well, we're going to consider the A-B join and discover that it has 2 times the cardinality of A, and then later we'll consider the (A-C)-B join and discover that it has 2 times the cardinality of A-C, and then later we'll consider the (A-D)-B join and discover that it has two times the cardinality of A-D, and then later we'll consider the (A-B-C)-D join and discover that it has two times the cardinality of A-B-C. As the number of tables grows, the number of times we rediscover the effect of joining to a certain table grows, I believe, exponentially. Of course, in theory, there's no reason why the idea that any particular join multiplies the row count by a constant that is independent of the join order has to be correct. For instance, if the A-B join multiplies the same rows that the A-D join eliminates, then you can't set a fixed multiplier for either one. But in practice, even when that's the case, we're not aware of it: when evaluating a join column from, say, the output of the A-B join, we use the statistics for the underlying column of A or B, without any modification reflecting what the A-B join did to the distribution. So I think that our estimates will behave like join-order-independent multipliers even when the reality is otherwise. I'm not at all sure that I'm totally right about this, and I sort of expect you or Tom to point out why I'm totally weak in the head for thinking that it's the case, but I feel like there might be the kernel of an idea here even if the details are wrong. Because, like, I think the general complexity of determining the optimal join order is known to be some horrifically non-polynomial time thing, but there are lots of real-world problems where I think a human being would try to do it in linear time, by initially choosing the driving table, and then what to join to first, second, third, etc. And I think that's the kind of case that you're trying to pursue with this optimization, so it feels intuitively logical that something should be possible. That said, I don't know whether from a theoretical perspective it really is possible to do something like this in a reliable way, making it a pure optimization, or whether you would be doomed to always get some cases wrong, making it more like an optional planner mode or something. Either way, I can't shake the feeling that determining essentially every important fact about every join only when we perform the main join search makes it a lot harder to imagine ever avoiding the main join search. -- Robert Haas EDB: http://www.enterprisedb.com
