On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.mu...@gmail.com> wrote: > I took this for a quick spin today. The DISTINCT ON support is nice > and I think it will be very useful. I've signed up to review it and > will have more to say later. But today I had a couple of thoughts > after looking into how src/backend/optimizer/plan/planagg.c works and > wondering how to do some more skipping tricks with the existing > machinery. > > 1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or > AVG(DISTINCT ...) or any other aggregate). Right now you get a seq > scan, with the sort/unique logic inside the Aggregate node. If you > write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get > a skip scan that is much faster in good cases. I suppose you could > have a process_distinct_aggregates() in planagg.c that recognises > queries of the right form and generates extra paths a bit like > build_minmax_path() does. I think it's probably better to consider > that in the grouping planner proper instead. I'm not sure.
I think to make that happen we'd need to do a bit of an overhaul in nodeAgg.c to allow it to make use of presorted results instead of having the code blindly sort rows for each aggregate that has a DISTINCT or ORDER BY. The planner would also then need to start requesting paths with pathkeys that suit the aggregate and also probably dictate the order the AggRefs should be evaluated to allow all AggRefs to be evaluated that can be for each sort order. Once that part is done then the aggregates could then also request paths with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd need to be pretty careful with that one since there may be other parts of the query that require that all rows are fed in, not just 1 row per value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't just feed through 1 row for each "i" value, since we need only the ones that have "z > 0". Getting the first part of this solved is much more important than making skip scans work here, I'd say. I think we need to be able to walk before we can run with DISTINCT / ORDER BY aggs. > 2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if > you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP > BY i if you're allowed to go backwards. Those queries are equivalent > to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC] > (though as Floris noted, the backwards version gives the wrong answers > with v20). That does seem like a much more specific thing applicable > only to MIN and MAX, and I think preprocess_minmax_aggregates() could > be taught to handle that sort of query, building an index only scan > path with skip scan in build_minmax_path(). For the MIN query you just need a path with Pathkeys: { i ASC, j ASC }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC. The more I think about these UniqueKeys, the more I think they need to be a separate concept to PathKeys. For example, UniqueKeys: { x, y } should be equivalent to { y, x }, but with PathKeys, that's not the case, since the order of each key matters. UniqueKeys equivalent version of pathkeys_contained_in() would not care about the order of individual keys, it would say things like, { a, b, c } is contained in { b, a }, since if the path is unique on columns { b, a } then it must also be unique on { a, b, c }. [1] https://www.postgresql.org/message-id/cakjs1f86fgoduunhiq25rkeues4qtqenxm1qbqjwrbozxvg...@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services