array_agg() does not stop aggregating according to HAVING clause

2024-08-17 Thread Dimitrios Apostolou

Hello list,

I have a query that goes through *billions* of rows and for the columns
that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it
selects all the IDs of the entries (array_agg(run_n)). Here is the full
query:


INSERT INTO infrequent_datatags_in_this_chunk
  SELECT datatag, datatags.datatag_n, array_agg(run_n)
FROM runs_raw
JOIN datatags USING(datatag_n)
WHERE workitem_n >= 295
  AND workitem_n <  714218
  AND datatag IS NOT NULL
GROUP BY datatags.datatag_n
HAVING  count(datatag_n) < 10
  AND   count(datatag_n) > 0  -- Not really needed because of the JOIN above
;

The runs_raw table has run_n as the primary key id, and an index on
workitem_n. The datatags table is a key value store with datatag_n as
primary key.

The problem is that this is extremely slow (5 hours), most likely because
it creates tens of gigabytes of temporary files as I see in the logs. I
suspect that it is writing to disk the array_agg(run_n) of all entries and
not only those HAVING count(datatag_n)<10. (I might be wrong though, as
this is only an assumption based on the amount of data written; I don't
know of any way to examine the temporary files written). While this query
is going through billions of rows, the ones with infrequent datatags are
maybe 10M.

How do I tell postgres to stop aggregating when count>=10?

Thank you in advance,
Dimitris




Re: array_agg() does not stop aggregating according to HAVING clause

2024-08-17 Thread Tom Lane
Dimitrios Apostolou  writes:
> I have a query that goes through *billions* of rows and for the columns
> that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it
> selects all the IDs of the entries (array_agg(run_n)). Here is the full
> query:

> INSERT INTO infrequent_datatags_in_this_chunk
>SELECT datatag, datatags.datatag_n, array_agg(run_n)
>  FROM runs_raw
>  JOIN datatags USING(datatag_n)
>  WHERE workitem_n >= 295
>AND workitem_n <  714218
>AND datatag IS NOT NULL
>  GROUP BY datatags.datatag_n
>  HAVING  count(datatag_n) < 10
>AND   count(datatag_n) > 0  -- Not really needed because of the JOIN 
> above
> ;

> The problem is that this is extremely slow (5 hours), most likely because
> it creates tens of gigabytes of temporary files as I see in the logs. I
> suspect that it is writing to disk the array_agg(run_n) of all entries and
> not only those HAVING count(datatag_n)<10.

Well, yes: the two aggregates (array_agg and count) are computed
concurrently in a single Aggregate plan node scanning the output
of the JOIN.  There's no way to apply the HAVING filter until
after the aggregation is finished.

I think this approach is basically forced by the SQL standard's
semantics for grouping/aggregation.

> How do I tell postgres to stop aggregating when count>=10?

The only way to do this would be to do two separate passes of
aggregation in separate sub-queries.  Perhaps like

WITH rare AS (
 SELECT datatag_n
 FROM runs_raw
 WHERE workitem_n >= 295
   AND workitem_n <  714218
   AND datatag IS NOT NULL
 GROUP BY datatag_n
 HAVING  count(datatag_n) < 10
   AND   count(datatag_n) > 0 
)
INSERT INTO infrequent_datatags_in_this_chunk
   SELECT datatag, datatags.datatag_n, array_agg(run_n)
 FROM runs_raw
 JOIN datatags USING(datatag_n)
 JOIN rare USING(datatag_n)
 GROUP BY datatags.datatag_n
;

I can't tell from what you said which level the workitem_n and
datatag conditions go at, so this is just a draft-quality
query.  But I think the structure is basically okay, given
that you said datatag_n is unique in datatags (so there's no
need to join it in the WITH step).

regards, tom lane




Re: array_agg() does not stop aggregating according to HAVING clause

2024-08-17 Thread Dimitrios Apostolou

On Sat, 17 Aug 2024, Tom Lane wrote:


Well, yes: the two aggregates (array_agg and count) are computed
concurrently in a single Aggregate plan node scanning the output
of the JOIN.  There's no way to apply the HAVING filter until
after the aggregation is finished.

I think this approach is basically forced by the SQL standard's
semantics for grouping/aggregation.


FWIW I also tried:

HAVING  array_length(array_agg(run_n), 1) < 10;

but I saw the same amount of temp files, at least in the short duration of
my test run.

Thank you, I will split this into two passes like you suggested. It's just
that I'm doing another 3 passes over this table for different things I
calculate (different GROUP BY, different WHERE clauses) and I was hoping
to minimize the time spent. But avoiding the array_agg() over everything
is my top priority ATM so I'll definitely try.

Regards,
Dimitris




Re: Column type modification in big tables

2024-08-17 Thread Greg Sabino Mullane
On Thu, Aug 15, 2024 at 4:41 PM Lok P  wrote:

> Additionally , if we are okay with the 7.5hrs of down time , is my
> calculation/extrapolation of total time consumption based on a sample
> table,  for direct alter, accurate? Because, in that case , I was thinking
> it's less complex and also less error prone to just do it in a single alter
> command rather than going for multiple steps of detach, alter, attach
> partition.
>

Well, it's meant to get you a ballpark figure, but yes, it seems as though
you will probably okay, But for something this critical involving
production downtime, I would try out the exact command and see how long it
takes on a test system. Restore a backup (you have backups I hope) or use
pg_basebackup to make a copy of your prod system somewhere.

Cheers,
Greg