Re: Dirty reads on index scan,

Frits Hoogland Fri, 22 Sep 2023 06:23:02 -0700

The actual thing that might be good to see is the query plan (explain).
It is commonly regarded an issue to select ‘*’, in many cases only a subset of 
the rows are needed, but I don’t know your exact case.
If a limited number of columns are actually needed from the table, it might 
help to create an index which has got all the columns in the index, either 
directly for the index, or included with the index.
This is called a covering index, and could prevent the need to read the actual 
table, which is visible by the row source 'index only scan’.
But that potential can only be assessed by looking at the explain output.

A covering index needs the visibility map to be recent for the blocks, 
otherwise a table visit must be done to get the latest tuple state. This can be 
done by vacuuming.

When your query is as efficient as it can be, there are two things left.
One is that blocks in the database buffer cache that are not frequently 
accessed will age out in favour of blocks that are accessed more recently. 
On the operating system, the same mechanism takes place, postgres reads data 
buffered, which means the operating system caches the IOs for the database 
blocks too.

This means that if you query data that is stored in blocks that are not 
recently used, these will not be present in the database cache, and not in the 
operating system cache, and thus require a physical IO from disk to be 
obtained. If the amount of blocks relative to the caches is modest, another 
execute of the same SQL can take advantage, and thus result in much lower 
latency.

You describe the query to be using a timestamp. If the timestamp moves forward 
in time, and the amount of data is equal over time, then the latency for the 
two scenario’s should remain stable. 
If the amount of data increases over time, and thus more blocks are needed to 
be read because more rows are stored that needs scanning to get a result, then 
the latency will increase.

Frits Hoogland

> On 22 Sep 2023, at 10:35, Koen De Groote <kdg....@gmail.com> wrote:
> 
> Alright.
> 
> So, if I want to speed up the query, apart from trying to vacuum it 
> beforehand, I suspect I've hit the limit of what this query can do?
> 
> Because, the table is just going to keep growing. And it's a usually a query 
> that runs one time per day, so it's a cold run each time.
> 
> Is this just going to get slower and slower and there's nothing that can be 
> done about it?
> 
> Regards,
> Koen De Groote
> 
> 
> 
> On Thu, Sep 21, 2023 at 9:30 PM Laurenz Albe <laurenz.a...@cybertec.at 
> <mailto:laurenz.a...@cybertec.at>> wrote:
>> On Thu, 2023-09-21 at 17:05 +0200, Koen De Groote wrote:
>> > I'm doing the following query:
>> > select * from my_table where hasbeenchecked = true and hasbeenverified = 
>> > true and insert_timestamp <= '2023-09-01 00:00:00.000' limit 1000;
>> > 
>> > The date is an example, it is the format that is used in the query.
>> > 
>> > The table has 81M rows. Is 50GB in size. And the index is 34MB
>> > 
>> > The index is as follows:
>> > btree (insert_timestamp DESC) WHERE hasbeenchecked = true AND 
>> > hasbeenverified = true
>> > 
>> > I'm seeing a slow query first, then a fast one, and if I move the date, a 
>> > slow query again.
>> > 
>> > What I'm seeing is:
>> > Attempt 1:
>> > Hit: 5171(40MB)
>> > Read: 16571(130MB)
>> > Dirtied: 3940(31MB)
>> > 
>> > Attempt 2:
>> > Hit: 21745 (170MB)
>> > Read: Nothing
>> > Dirtied: Nothing.
>> > 
>> > It's slow once, then consistently fast, and then slow again if I move the 
>> > date around.
>> > And by slow I mean: around 60 seconds. And fast is below 1 second.
>> 
>> That's normal behavior: after the first execution, the data are cached, so 
>> the query
>> becomes much faster.
>> 
>> Dirtying pages happens because the first reader has to set hint bits, which 
>> is an extra
>> chore.  You can avoid that if you VACUUM the table before you query it.
>> 
>> Yours,
>> Laurenz Albe

Re: Dirty reads on index scan,

Reply via email to