Re: Corrupt btree index includes rows that don't match

Erik Johnston Fri, 04 Jul 2025 06:49:59 -0700

Hi, a quick update:

- We have discovered that the corruption was present from before libicuupdate.- We ran `pg_amcheck --index state_groups_state_type_idx--heapallindexed matrix`, which returned nothing- We believe that means that (and matches what we see sampling) theindex has gained extra entries, i.e. that for a given state group itdoes return all the relevant rows in the table /plus/ extra rows.

We are also seeing old state groups starting to point at rows that haveonly just been inserted. For example, querying for 353864583 on theprimary it returns that row plus four rows that have been insertedtoday, but on the backup from last week an index only scan for 353864583only returns one row. This makes it feel like the corruption is ongoing?Nothing should have modified that state group in the interim (they aregenerally immutable).

This naively feels like when inserting a new row we sometimes add therow to the index twice: once pointing from the correct state group tothe new row, and once from an old state group to the new row?



Thanks,
Erik

On 03/07/2025 18:07, Erik Johnston wrote:

Hello,
We're looking into a problem with our application and have tracked itdown to index corruption, whereby we have many index rows pointing tothe wrong tuples in the heap.
Our table looks like:


```

           Table "matrix.state_groups_state"
   Column    |  Type  | Collation | Nullable | Default
-------------+--------+-----------+----------+---------
 state_group | bigint |           |          |
 room_id     | text   |           |          |
 type        | text   |           |          |
 state_key   | text   |           |          |
 event_id    | text   |           |          |
Indexes:
"state_groups_state_room_id_idx" brin (room_id) WITH(pages_per_range='1') "state_groups_state_type_idx" btree (state_group, type,state_key), tablespace "postgres_second"
Triggers:
check_state_groups_state_deletion_trigger AFTER DELETE ONstate_groups_state DEFERRABLE INITIALLY DEFERRED FOR EACH ROW EXECUTEFUNCTION check_state_groups_state_deletion()
```
The symptoms we are noticing are that a DELETE or SELECT queryincludes rows that don't match the condition, as long as we issue aquery that results in an Index Scan (not Index Only Scan):
For example, including `ctid` in the query is enough to make theplanner use an Index Scan:
```
SELECT ctid, state_group FROM state_groups_state WHERE state_group =483128098;
      ctid      | state_group
----------------+-------------
 (16669607,1)   |   483128098
 (424940858,20) |   963361875
 (16669606,53)  |   483128098
(3 rows)

```


But with an Index Only Scan:


```

SELECT state_group FROM state_groups_state WHERE state_group = 483128098;
 state_group
-------------
   483128098
   483128098
   483128098
(3 rows)

```
Since including `ctid` in the SELECT columns causes the query to usean Index Scan (fetching tuples from the heap), this inconsistencyleads us to believe that our index and heap disagree.
Forcing a sequential scan with that same query only returns two rowsmatching that state group, which suggests that the index thinks thereare more rows in the table than there actually are. (We do not believeanything can have deleted a row with state group 483128098). Alsointerestingly, querying (with the index re-enabled) for 963361875returns the same row as returned above, so the row is in the index twice.
Another example state group (147961623) should only have a single rowassociated with it, and yet the index returns nearly 7000 rows(including the one we expect). The unexpected state groups are all inthe range 794390760–794393085 (except one in 794411694), and also havectids in range (93454823,48) – (93455621,49). The fact that these arereasonably tight ranges feels suspicious. Note that the state group isa simple incrementing ID here.
This table is quite large (about 6 TB) but we have sampled a few smallranges of it and found many instances of this type of corruption, inthe first (approximate) half of the key range (0..561M out of 0..1034M).
For historical reasons, the table and the index are on differenttablespaces, but the same filesystem.
We have sampled the table on our secondary server, and we see the samesort of corruption going on (though given the size of the data wedon’t actually know if it's exactly the same).
One coincidence is that we started seeing the first symptoms of thisaround the same time as libicu was updated with a security patch.However, postgres hasn’t been restarted and doesn’t reference the newversion in its process maps. Plus state groups are integers anyway. Wealso use the C locale, not ICU.
We’re currently running “pg_amcheck --indexstate_groups_state_type_idx --heapallindexed” on our secondary to seewhat it says, but we expect that to take a long time to complete.
Thankfully, we have database backups so hopefullywe should be able torestore the data. However, any thoughts on how this happened or whereto look next would be greatly appreciated. Thoughts on how to checkour other indexes for corruption would also be very welcome.
Thanks,

Erik



Further details of our setup:

  * 2 servers in physical replication (one primary, one secondary as a
    hot standby)
      o both servers display the corruption
  * ECC RAM
  * 8 NVME SSD, raid10 (mdraid), LVM, ext4 filesystem.
      o smartctl and mdadm report healthy disks
  * Debian, postgres installed via apt.
  * Postgres version: PostgreSQL 14.11 (Debian 14.11-1.pgdg120+1) on
    x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
  * Kernel version 6.1.0-22-amd64, GLIBC 2.36-9+deb12u10
Copyright © 2025 Element - All rights reserved. The Element name, logoand device are registered trademarks of New Vector Ltd. Registerednumber: 10873661. Registered in England and Wales. Registered address:10 Queen Street Place, London, United Kingdom, EC4R 1AG.
This message is intended for the addressee only and may containprivate and confidential information or material which may beprivileged. If this message has come to you in error please delete itimmediately and do not copy it or show it to any other person.

--

Copyright © 2025 Element - All rights reserved. The Element name, logoand device are registered trademarks of New Vector Ltd. Registered number:10873661. Registered in England and Wales. Registered address: 10 QueenStreet Place, London, United Kingdom, EC4R 1AG.

This message is intendedfor the addressee only and may contain private and confidential informationor material which may be privileged. If this message has come to you inerror please delete it immediately and do not copy it or show it to anyother person.

Re: Corrupt btree index includes rows that don't match

Reply via email to