A way to optimize sql about the last temporary-related row

2024-06-27 Thread aghart...@gmail.com

Hello everyone,
Sorry to bother you but I have a query that is driving me crazy.

I need to have the last valid record at a temporal level according to a 
specific parameter.


First some data:
Linux Rocky 8.10 environment, minimal installation (on VM KVM with 
Fedora 40).

Postgresql 16.3, installed by official Postgresql guide.
effective_cache_size = '1000 MB';
shared_buffers = '500 MB';
work_mem = '16MB';
The changes are deliberately minimal to be able to all to simulate the 
problem.


Table script:
CREATE TABLE test_table
(
 pk_id int NOT NULL,
 integer_field_1 int ,
 integer_field_2 int,
 datetime_field_1 timestamp,
 primary key (pk_id)
)

-- insert 4M records
insert into test_table(pk_id) select generate_series(1,400,1);

-- now set some random data, distribuited between specific ranges (as in 
my production table)

update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() * 
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),

integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (20-1+1) + 1)::int;


-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON test_table(datetime_field_1 
desc);

CREATE INDEX idx_test_table_integer_field_2 ON test_table(integer_field_2);


--vacuum

vacuum full test_table;



Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
  select xx from (
  select LAST_VALUE(pk_id) over (partition by integer_field_2 order by 
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING) xx

  from test_table
  where integer_field_1 = 1
  and datetime_field_1 <= CURRENT_TIMESTAMP
  ) ww group by ww.xx

),
last_row_per_ids as (
  select tt.* from last_table_ids lt
  inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


This query, on my PC, takes 46 seconds!!!
I was expecting about 2-3 seconds (according with my other queries in 
this table) but it seems that the xtest_table_datetime_field_1 index is 
not being used.


Do you think there is a way to optimize the query?

Thanks so much for the support,

Agharta





Re: A way to optimize sql about the last temporary-related row

2024-06-27 Thread aghart...@gmail.com

Hi,

You are right. Too quickly copy-paste on my part :-)

I take this opportunity to add a NOT insignificant detail.

Before executing the select query I clear the cache:

systemctl stop postgresql-16 && sync && echo 3 > 
/proc/sys/vm/drop_caches  &&  systemctl start postgresql-16


I need to get a performance result even if data is not in cache.


My best regards,

Agharta


Il 27/06/24 5:27 PM, Ron Johnson ha scritto:
On Thu, Jun 27, 2024 at 11:20 AM aghart...@gmail.com 
 wrote:

[snip]

-- insert 4M records
insert into test_table(pk_id) select generate_series(1,400,1);

-- now set some random data, distribuited between specific ranges
(as in
my production table)
update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() *
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),
integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (20-1+1) + 1)::int;


-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON
test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON
test_table(datetime_field_1
desc);
CREATE INDEX idx_test_table_integer_field_2 ON
test_table(integer_field_2);


Off-topic: save some resources by vacuuming before creating indices.


Re: A way to optimize sql about the last temporary-related row

2024-06-27 Thread aghart...@gmail.com

Hi,

Thanks for you reply.

About syntax you're right, but I couldn't think of anything better :(((  
I'm here for that too, to solve the problem in a fancy way, with your 
great support.


In practice, I need to get back a dataset with the last association (the 
most datatime recent record) for all the distinct entries of 
integer_field_2 based on filter:  integer_field_1 = 1


As said in another reply, the query needs to be performant even if data 
is not in cache (systemctl stop postgresql-16 && sync && echo 3 > 
/proc/sys/vm/drop_caches  && systemctl start postgresql-16).


Many thanks for your support.

Agharta









Il 27/06/24 5:33 PM, David G. Johnston ha scritto:
On Thursday, June 27, 2024, aghart...@gmail.com  
wrote:



Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
  select xx from (
  select LAST_VALUE(pk_id) over (partition by integer_field_2
order by datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) xx
  from test_table
  where integer_field_1 = 1
  and datetime_field_1 <= CURRENT_TIMESTAMP
  ) ww group by ww.xx

),
last_row_per_ids as (
  select tt.* from last_table_ids lt
  inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


Do you think there is a way to optimize the query?


Write a lateral subquery to pick the first row of a descending ordered 
query? Using group to select ranked rows is both semantically wrong 
and potentially optimization blocking.


I’m going by the general query form and the “last row” aspect of the 
question.  I haven’t gone and confirmed your specific query can 
benefit from this approach. The window expression does give me pause.


David J.


Re: A way to optimize sql about the last temporary-related row

2024-06-28 Thread aghart...@gmail.com

HOO-HA! This is HUGE!

Only 2.2 seconds on my data Amazing!

distinct on (field) *followed by "*" *is a hidden gem!

Thank you so much and thanks to everyone who helped me!  Thank you very 
much!!


Cheers,

Agharta



Il 27/06/24 6:16 PM, David Rowley ha scritto:



On Fri, 28 Jun 2024, 3:20 am aghart...@gmail.com, 
 wrote:



Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by integer_field_2
order by
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
   select tt.* from last_table_ids lt
   inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


This query, on my PC, takes 46 seconds!!!


(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where 
integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by 
integer_field_2,datetime_field_1 desc;


Might run a bit faster.  However if it's slow due to I/O then maybe 
not much faster.  Your version took about 5 seconds on my phone and my 
version ran in 1.5 seconds.


It's difficult for me to check the results match with each query from 
my phone. A quick scan of the first 10 or so records looked good.


If the updated query is still too slow on cold cache then faster disks 
might be needed.


David