Thank you Kevin and Jeff for the responses.
These are very helpful.

On Fri, Sep 6, 2013 at 10:48 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:

> On Friday, September 6, 2013, pg noob wrote:
>
>>
>> Hi all,
>>
>> I'm curious about some of the query estimates that I'm seeing with
>> queries that use DISTINCT.
>> I am using postgres 8.4.13
>>
>> I did a couple of quick tests, and found that PostgreSQL seems to do some
>> expensive work to
>> return DISTINCT rows.  This is contrary to what I was expecting because I
>> expected that
>> if it knows that it is building a distinct result set that it would
>> maintain
>> some kind of ordering as it built the results to be able to quickly throw
>> away duplicates without any real overhead to the query.
>>
>
> There is no cost-free way of doing that.   You either have to sort, or
> hash, or walk in index order, and none of those things are free.
>
> ...
>
>
>> But it gets a bit stranger than that.
>> According to the docs,
>> “DISTINCT ON column will eliminate all duplicates in the specified
>> column; this is equivalent to using GROUP BY column.”
>>
>
> DISTINCT ON is not the same thing as DISTINCT, which is not the same as
> count(distinct ...)
>
>
>
>
>>
>> Note that it says that using distinct is EQUIVALENT to using GROUP BY.
>> And yet, look at the execution time difference of DISTINCT vs. GROUP BY:
>>
>> db=# explain analyze select count(distinct(column1)) from table1;
>>                                                           QUERY
>> PLAN
>>
>> ------------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual
>> time=18596.606..18596.607 rows=1 loops=1)
>>    ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8)
>> (actual time=0.003..2391.644 rows=16151368 loops=1)
>>  Total runtime: 18596.631 ms
>> (3 rows)
>>
>> db=# explain analyze select count(foo.column1) from (select column1 from
>> table1 group by column1) as foo;
>>
>
> Compare that to:
>
> explain analyze select count(*) from (select DISTINCT column1 from table1)
> as foo;
>
>
>>
>> Any ideas why this is?  Or what I am doing wrong?
>>
>
> The code that executes "count(distinct ....)" has never been taught how to
> use hash aggregation, whereas DISTINCT and GROUP BY have been.  It always
> sorts, which is often much slower than a hash, and also often much less
> memory efficient.   I speculate that this is because the implementation of
> count(distinct ...) is really ancient code and never been brought up to
> date, either about hashing or about more thorough EXPLAIN estimates--I have
> been meaning to dig into it but haven't gotten around to it yet.
>
> You can get the increased performance by just spelling it in the more
> verbose way, but be aware that count (distinct ...) deals with NULL
> differently than select count(*) from (select DISTINCT...)  does, so they
> are not exactly identical.
>
> Cheers,
>
> Jeff
>

Reply via email to