date:20231115

Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou


Hello list,

SELECT [DISTINCT] ... EXCEPT ...

In this query I get the same results regardless of including DISTINCT or
not. But I get different query plans, I get an extra HashAggregate node
in the case of SELECT DISTINCT. Any idea why?

Reading the docs, I understand that postgres does by default
EXCEPT DISTINCT, but I don't see the same for the 1st clause,
SELECT [DISTINCT].


Steps to reproduce:

CREATE TABLE t1(i INTEGER NOT NULL);
CREATE TABLE t2(i INTEGER NOT NULL);

INSERT INTO t1 VALUES (1),(2),(1),(2),(3),(3),(3);
INSERT INTO t2 VALUES (3);

SELECT * FROM t1 EXCEPT SELECT * FROM t2;
 i
---
 2
 1
(2 rows)

SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2;
 i
---
 2
 1
(2 rows)


EXPLAIN SELECT * FROM t1 EXCEPT SELECT * FROM t2;
   QUERY PLAN
-
 HashSetOp Except  (cost=0.00..160.25 rows=200 width=8)
   ->  Append  (cost=0.00..147.50 rows=5100 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

EXPLAIN SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2;
   QUERY PLAN
-
 HashSetOp Except  (cost=41.88..127.50 rows=200 width=8)
   ->  Append  (cost=41.88..120.62 rows=2750 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=41.88..45.88 rows=200 width=8)
   ->  HashAggregate  (cost=41.88..43.88 rows=200 width=4)
 Group Key: t1.i
 ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(8 rows)



Regards,
Dimitris

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Erik Wienhold

On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote:
> SELECT [DISTINCT] ... EXCEPT ...
> 
> In this query I get the same results regardless of including DISTINCT or
> not. But I get different query plans, I get an extra HashAggregate node
> in the case of SELECT DISTINCT. Any idea why?

As Tom Lane recently wrote[1] EXCEPT is not optimized and will operate
on the subqueries which are planned independently.

[1] https://www.postgresql.org/message-id/2664450.1698799...@sss.pgh.pa.us

-- 
Erik

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou


On Wed, 15 Nov 2023, Erik Wienhold wrote:


On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote:

SELECT [DISTINCT] ... EXCEPT ...

In this query I get the same results regardless of including DISTINCT or
not. But I get different query plans, I get an extra HashAggregate node
in the case of SELECT DISTINCT. Any idea why?


As Tom Lane recently wrote[1] EXCEPT is not optimized and will operate
on the subqueries which are planned independently.

[1] https://www.postgresql.org/message-id/2664450.1698799...@sss.pgh.pa.us


Heh, as he wrote to me even. :-) I just wanted to make sure that this is
indeed a missing optimisation of the planner, and that the queries are
effectively the same. Thank you for clarifying.

As mentioned, the docs don't make it clear if the SELECT DISTINCT part is
implied or not, only the EXCEPT DISTINCT part is clearly on by default.


Dimitris

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Erik Wienhold

On 2023-11-15 12:12 +0100, Dimitrios Apostolou wrote:
> On Wed, 15 Nov 2023, Erik Wienhold wrote:
> 
> > On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote:
> > > SELECT [DISTINCT] ... EXCEPT ...
> > > 
> > > In this query I get the same results regardless of including DISTINCT or
> > > not. But I get different query plans, I get an extra HashAggregate node
> > > in the case of SELECT DISTINCT. Any idea why?
> > 
> > As Tom Lane recently wrote[1] EXCEPT is not optimized and will operate
> > on the subqueries which are planned independently.
> > 
> > [1] https://www.postgresql.org/message-id/2664450.1698799...@sss.pgh.pa.us
> 
> Heh, as he wrote to me even. :-) I just wanted to make sure that this is
> indeed a missing optimisation of the planner, and that the queries are
> effectively the same. Thank you for clarifying.
> 
> As mentioned, the docs don't make it clear if the SELECT DISTINCT part is
> implied or not, only the EXCEPT DISTINCT part is clearly on by default.

SELECT ALL is the default as spelled out in [1].  DISTINCT as the
default for UNION/EXCEPT/INTERSECT makes sense because those are set
operators.  I guess SELECT ALL is the default because SQL allows
duplicate rows (contrary to the relation model) and the user should
instead be explicit about wanting distinct rows which requires
additional computation.

But when combining subqueries with the default UNION/EXCEPT/INTERSECT
you effectively get SELECT DISTINCT ... UNION SELECT DISTINCT ... when
it comes to the result.

[1] https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT

-- 
Erik

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Laurenz Albe

On Wed, 2023-11-15 at 10:57 +0100, Dimitrios Apostolou wrote:
> SELECT [DISTINCT] ... EXCEPT ...
> 
> In this query I get the same results regardless of including DISTINCT or
> not. But I get different query plans, I get an extra HashAggregate node
> in the case of SELECT DISTINCT. Any idea why?

The DISTINCT is superfluous, because EXCEPT already removes duplicate rows.
However, the planner does not invest extra processing cycles to detect
that you wrote a superfluous DISTINCT, and it does not remove it.
As a consequence, you end up with a pointless extra execution plan node
that does not achieve anything except slowing down the query.

Remove the DISTINCT.

Yours,
Laurenz Albe

db size difference on primary and standby

2023-11-15 Thread Eric Wong

Hi postgres guru out there:

We discovered a disk usage difference between our primary and standby node.

After much investigation, the disk usage difference is caused by some
extra files within the database directory 16432.  The extra files are
only found on the standby node, not present on the primary.

In the data directory ("15/data/base/16432"), the extra files do not
seem to belong to any table. Normally, the files correspond to the
column "relfilenode" in the "pg_class" system table; however, when we
compare the entries in that column with the file names, around 600 of
the files do not have a corresponding entry in the "pg_class" table.
The same applies to the files in the "san" tablespace
("pg_tblspc/san/PG_15_202209061/16432").

Has anyone seen this behavior before?  Also, we also want to know if
it is safe to remove all those extra files that do not have an entry
in the "pg_class" table?

Thanks in advance
Eric

Re: db size difference on primary and standby

2023-11-15 Thread Laurenz Albe

On Thu, 2023-11-16 at 14:49 +0800, Eric Wong wrote:
> We discovered a disk usage difference between our primary and standby node.
> 
> After much investigation, the disk usage difference is caused by some
> extra files within the database directory 16432.  The extra files are
> only found on the standby node, not present on the primary.
> 
> In the data directory ("15/data/base/16432"), the extra files do not
> seem to belong to any table. Normally, the files correspond to the
> column "relfilenode" in the "pg_class" system table; however, when we
> compare the entries in that column with the file names, around 600 of
> the files do not have a corresponding entry in the "pg_class" table.
> The same applies to the files in the "san" tablespace
> ("pg_tblspc/san/PG_15_202209061/16432").
> 
> Has anyone seen this behavior before?  Also, we also want to know if
> it is safe to remove all those extra files that do not have an entry
> in the "pg_class" table?

If the files only exist on the standby, you can safely remove them.
Be careful, however: if you make a mistake, you'll corrupt your database.

Orphaned files like that can be left behind after a database crash.
Did you experience crashes on the standby?

Yours,
Laurenz Albe

Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

db size difference on primary and standby

Re: db size difference on primary and standby

7 matches

Site Navigation

Mail list logo

Footer information