Thank you Kevin and Jeff for the responses. These are very helpful.
On Fri, Sep 6, 2013 at 10:48 PM, Jeff Janes <jeff.ja...@gmail.com> wrote: > On Friday, September 6, 2013, pg noob wrote: > >> >> Hi all, >> >> I'm curious about some of the query estimates that I'm seeing with >> queries that use DISTINCT. >> I am using postgres 8.4.13 >> >> I did a couple of quick tests, and found that PostgreSQL seems to do some >> expensive work to >> return DISTINCT rows. This is contrary to what I was expecting because I >> expected that >> if it knows that it is building a distinct result set that it would >> maintain >> some kind of ordering as it built the results to be able to quickly throw >> away duplicates without any real overhead to the query. >> > > There is no cost-free way of doing that. You either have to sort, or > hash, or walk in index order, and none of those things are free. > > ... > > >> But it gets a bit stranger than that. >> According to the docs, >> “DISTINCT ON column will eliminate all duplicates in the specified >> column; this is equivalent to using GROUP BY column.” >> > > DISTINCT ON is not the same thing as DISTINCT, which is not the same as > count(distinct ...) > > > > >> >> Note that it says that using distinct is EQUIVALENT to using GROUP BY. >> And yet, look at the execution time difference of DISTINCT vs. GROUP BY: >> >> db=# explain analyze select count(distinct(column1)) from table1; >> QUERY >> PLAN >> >> ------------------------------------------------------------------------------------------------------------------------------ >> Aggregate (cost=631397.32..631397.33 rows=1 width=8) (actual >> time=18596.606..18596.607 rows=1 loops=1) >> -> Seq Scan on table1 (cost=0.00..591411.65 rows=15994265 width=8) >> (actual time=0.003..2391.644 rows=16151368 loops=1) >> Total runtime: 18596.631 ms >> (3 rows) >> >> db=# explain analyze select count(foo.column1) from (select column1 from >> table1 group by column1) as foo; >> > > Compare that to: > > explain analyze select count(*) from (select DISTINCT column1 from table1) > as foo; > > >> >> Any ideas why this is? Or what I am doing wrong? >> > > The code that executes "count(distinct ....)" has never been taught how to > use hash aggregation, whereas DISTINCT and GROUP BY have been. It always > sorts, which is often much slower than a hash, and also often much less > memory efficient. I speculate that this is because the implementation of > count(distinct ...) is really ancient code and never been brought up to > date, either about hashing or about more thorough EXPLAIN estimates--I have > been meaning to dig into it but haven't gotten around to it yet. > > You can get the increased performance by just spelling it in the more > verbose way, but be aware that count (distinct ...) deals with NULL > differently than select count(*) from (select DISTINCT...) does, so they > are not exactly identical. > > Cheers, > > Jeff >