Optimizing execution of expensive subqueries

2018-07-11 Thread Mathieu Fenniak
Hi pgsql-general!

I'm currently looking at a query that is generally selecting a bunch of
simple columns from a table, and also performing some subqueries to
aggregate related data, and then sorting by one of the simple columns and
paginating the result.

eg.

SELECT
  tbl.field1, tbl.field2, tbl.field3, ...,
  (SELECT SUM(Duration) FROM anothertbl WHERE anothertbl.UserId = tbl.UserId
AND anothertbl.ThingyId = 1) as Thingy1Sum,
  ... repeat for multiply thingies ...
FROM
  tbl
ORDER BY tbl.field1 LIMIT 20

I'm finding that if "tbl" contains hundreds of thousands of rows, the
subqueries are being executed hundreds of thousands of times.  Because of
the sorting and pagination, this is appears to be unnecessary, and the
result is slow performance.  (PostgreSQL 9.5.9 server)

I've only found one solution so far, which is to perform the sort &
pagination in a CTE, and the subqueries externally.  Are there any other
approaches that can be taken to optimize this and prevent the unnecessary
computation?

CTE rewrite:

WITH cte AS (
SELECT
  tbl.field1, tbl.field2, tbl.field3
FROM
  tbl
ORDER BY tbl.field1 LIMIT 20
)
SELECT cte.*,
  (SELECT SUM(Duration) FROM anothertbl WHERE anothertbl.UserId
= tbl.UserId AND anothertbl.ThingyId = 1) as Thingy1Sum,
  ... repeat for multiply thingies ...
FROM cte;

Thanks for any thoughts you have,

Mathieu Fenniak


Re: Optimizing execution of expensive subqueries

2018-07-13 Thread Mathieu Fenniak
Hi Hellmuth,

Thanks for the response and the new approach; a LATERAL JOIN is new to me.
Unfortunately it seems to have the same performance characteristics and
query plan.  The aggregation in the lateral join still executes for every
row (eg. if my base query has 50 rows, I get "Aggregate (...
loops=50)" in the query plan), unaffected by the later LIMIT node in
the query plan.

The CTE approach seems to be the only one I can use to improve performance
right now, but requires significant application code changes.

Mathieu


On Wed, Jul 11, 2018 at 1:55 PM Hellmuth Vargas  wrote:

> Hi
>
> Try this way:
>
> SELECT
>   tbl.field1, tbl.field2, tbl.field3, ...,
>   b.Thingy1Sum,
>   ... repeat for multiply thingies ...
> FROM
>   tbl
>   LATERAL JOIN (
> SELECT anothertbl.UserId,SUM(Duration) as Thingy1Sum
> FROM anothertbl
> WHERE anothertbl.UserId = tbl.UserId AND anothertbl.ThingyId = 1
> group  by 1) as b on tbl.UserId=b.UserId
> ORDER BY tbl.field1 LIMIT 20
>
>
> El mié., 11 de jul. de 2018 a la(s) 09:25, Mathieu Fenniak (
> mathieu.fenn...@replicon.com) escribió:
>
>> Hi pgsql-general!
>>
>> I'm currently looking at a query that is generally selecting a bunch of
>> simple columns from a table, and also performing some subqueries to
>> aggregate related data, and then sorting by one of the simple columns and
>> paginating the result.
>>
>> eg.
>>
>> SELECT
>>   tbl.field1, tbl.field2, tbl.field3, ...,
>>   (SELECT SUM(Duration) FROM anothertbl WHERE anothertbl.UserId = tbl.UserId
>> AND anothertbl.ThingyId = 1) as Thingy1Sum,
>>   ... repeat for multiply thingies ...
>> FROM
>>   tbl
>> ORDER BY tbl.field1 LIMIT 20
>>
>> I'm finding that if "tbl" contains hundreds of thousands of rows, the
>> subqueries are being executed hundreds of thousands of times.  Because of
>> the sorting and pagination, this is appears to be unnecessary, and the
>> result is slow performance.  (PostgreSQL 9.5.9 server)
>>
>> I've only found one solution so far, which is to perform the sort &
>> pagination in a CTE, and the subqueries externally.  Are there any other
>> approaches that can be taken to optimize this and prevent the unnecessary
>> computation?
>>
>> CTE rewrite:
>>
>> WITH cte AS (
>> SELECT
>>   tbl.field1, tbl.field2, tbl.field3
>> FROM
>>   tbl
>> ORDER BY tbl.field1 LIMIT 20
>> )
>> SELECT cte.*,
>>   (SELECT SUM(Duration) FROM anothertbl WHERE anothertbl.UserId
>> = tbl.UserId AND anothertbl.ThingyId = 1) as Thingy1Sum,
>>   ... repeat for multiply thingies ...
>> FROM cte;
>>
>> Thanks for any thoughts you have,
>>
>> Mathieu Fenniak
>>
>
>
> --
> Cordialmente,
>
> Ing. Hellmuth I. Vargas S.
>
>
>


Complex filters -> Bad row estimates -> bad query plan

2019-08-21 Thread Mathieu Fenniak
Hi all,

I have a database query where I have a number of "simple" where clauses, a
number of "complex" subquery based where clauses, and one NOT EXISTS where
clause; it looks something like this:

SELECT ...some fields... FROM Table1
WHERE
   Field1 IN (1, 2, 3, 4, 21, 24) AND -- simple filter
   Field2  <> 1 AND -- simple filter
   Field3 >= '2019-07-08' AND -- simple filter
   Field3 <= '2019-08-18' AND -- simple filter
   NOT EXISTS (SELECT 1 FROM Table2 WHERE Table2.Status = 2 AND
Table2.SomeId = Table1.Id) AND -- anti-join
   COALESCE((SELECT Status FROM Table3 WHERE Table3.SomeId = Table1.Id),
(SELECT Status FROM Table4 WHERE Table4.SomeId = Table1.Id)) = 2 --
"complex" condition

The problem I'm encountering is that I've observed degraded performance in
some cases where the Anti Join merge for the NOT EXISTS clause is planned
based upon poor row estimates for Table1.  All of the other filters, and
the SubPlan filter(s) for the complex clauses, result in the query planner
estimating that only 1 row from Table1 will be resulting, so a Nested Loop
Anti Join is used and the RHS of that nested loop in a seqscan on Table2.
The reality is that many thousands of records match all the conditions; a
Merge Anti Join or Hash Anti Join would be a better query plan.

I've tested the query planner with just the simpler conditions, and it
makes pretty reasonable estimates about the row count (+/- 10%).  Adding
the NOT EXISTS results in a Merge Anti Join, and performance is great.

Adding the more "complex" conditions (there are potentially multiple of
these subquery plan searches) results in the estimated row count dropping
to 1, and, performance dives.

I know there are no "query hints" in PostgreSQL... any thoughts on
alternative approaches here?  The only option I've used in the past for
this is creating specialized indexes, which can provide more targeted
statistics; but it's not applicable here since the "complex" conditions use
data from another table in a subquery.

Appreciate any thoughts, theories, or directions. :-)  Thanks,

Mathieu


Re: Complex filters -> Bad row estimates -> bad query plan

2019-08-21 Thread Mathieu Fenniak
Thanks Michael.

I'll give some join alternatives a shot first... but, that's cool.

What about OFFSET 0 makes this approach work?  I'm thinking the OFFSET 0
create an optimization barrier that prevents the planner from collapsing
that sub-query into the top query, and enforces ordering in the query?

I appreciate your thoughts, thank-you very much for the feedback.

Mathieu


On Wed, Aug 21, 2019 at 12:08 PM Michael Lewis  wrote:

> If those conditions that are throwing off the stats are expected to be
> minimally impactful/filtering few rows, then you can use the one
> tried-and-true optimizer hint (aside from materialized CTEs, stylized
> indexes, etc) --- OFFSET 0 at the end of a sub-query.
>
> SELECT * FROM ( [your existing query without the sub-selects that are
> complicated and produce bad estimates] OFFSET 0 ) WHERE [your other
> conditions that don't produce good estimates]
>
> If there is correlation between field1 and field2, you might also look at
> CREATE STATISTICS assuming you are on PG 10 or 11.
>
> Before I do any of that, I would try LEFT JOIN for Table3 and Table4 then
> use the where conditon  "AND 2 = COALESCE( Table3.Status, Table4.Status"
> and see if the optimizer likes that option better.
>