Threads like [1] and [2] have gotten me thinking that there may be some value in storing statistics about joins.
For the sake of argument, assume a table t1 with a column t2id which references the pk of table t2 that has columns t2.t2id, t2c1, t2c2, t2c3. In such a situation I can envision the following statistics being collected: * The % of values rows in t2 are referenced at least once in t1 * The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3, but associated with t1 and weighted according to the frequency of that row being referenced, which means that values of unreferenced rows are filtered out entirely. * That's about it for direct statistics, but I could see creating extended statistics for correlations between a local column value and a remote column, or expressions on the remote columns, etc. The storage feels like it would be identical to pg_statistic but with a "starefrelid" field that identifies the referencing table. That much seems straightforward. A bigger problem is how we'd manage to collect these statistics. We could (as Jeff Davis has suggested) keep our tablesamples, but that wouldn't necessarily help in this case because the rows referenced, and their relative weightings would change since the last sampling. In a worst-case scenario, We would have to sample the joined-to tables as well,and that's an additional burden on an already IO intensive operation. In theory, we could do some of this without any additional stats collection. If the ndistinct of t1.t2id is, say, at least 75+% of the ndistinct of t2.t2id, we could just peek at the attribute stats on t2 and use them for estimates. However, that makes some assumptions that the stats on t2 are approximately as fresh as the stats on t1, and I don't think that will be the case most of the time. CCing people who have wondered out loud about this topic within earshot of me. Thoughts? [1] https://www.postgresql.org/message-id/flat/6fdc4dc5-8881-4987-9858-a9b484953185%40joeconway.com#5a93cd7a730691843a7700c770397baf [2] https://www.postgresql.org/message-id/flat/tencent_3018762E7D4C9BC470C821C829C1BF2F650A%40qq.com
