Hi, a quick update:
- We have discovered that the corruption was present from before libicu
update.
- We ran `pg_amcheck --index state_groups_state_type_idx
--heapallindexed matrix`, which returned nothing
- We believe that means that (and matches what we see sampling) the
index has gained extra entries, i.e. that for a given state group it
does return all the relevant rows in the table /plus/ extra rows.
We are also seeing old state groups starting to point at rows that have
only just been inserted. For example, querying for 353864583 on the
primary it returns that row plus four rows that have been inserted
today, but on the backup from last week an index only scan for 353864583
only returns one row. This makes it feel like the corruption is ongoing?
Nothing should have modified that state group in the interim (they are
generally immutable).
This naively feels like when inserting a new row we sometimes add the
row to the index twice: once pointing from the correct state group to
the new row, and once from an old state group to the new row?
Thanks,
Erik
On 03/07/2025 18:07, Erik Johnston wrote:
Hello,
We're looking into a problem with our application and have tracked it
down to index corruption, whereby we have many index rows pointing to
the wrong tuples in the heap.
Our table looks like:
```
Table "matrix.state_groups_state"
Column | Type | Collation | Nullable | Default
-------------+--------+-----------+----------+---------
state_group | bigint | | |
room_id | text | | |
type | text | | |
state_key | text | | |
event_id | text | | |
Indexes:
"state_groups_state_room_id_idx" brin (room_id) WITH
(pages_per_range='1')
"state_groups_state_type_idx" btree (state_group, type,
state_key), tablespace "postgres_second"
Triggers:
check_state_groups_state_deletion_trigger AFTER DELETE ON
state_groups_state DEFERRABLE INITIALLY DEFERRED FOR EACH ROW EXECUTE
FUNCTION check_state_groups_state_deletion()
```
The symptoms we are noticing are that a DELETE or SELECT query
includes rows that don't match the condition, as long as we issue a
query that results in an Index Scan (not Index Only Scan):
For example, including `ctid` in the query is enough to make the
planner use an Index Scan:
```
SELECT ctid, state_group FROM state_groups_state WHERE state_group =
483128098;
ctid | state_group
----------------+-------------
(16669607,1) | 483128098
(424940858,20) | 963361875
(16669606,53) | 483128098
(3 rows)
```
But with an Index Only Scan:
```
SELECT state_group FROM state_groups_state WHERE state_group = 483128098;
state_group
-------------
483128098
483128098
483128098
(3 rows)
```
Since including `ctid` in the SELECT columns causes the query to use
an Index Scan (fetching tuples from the heap), this inconsistency
leads us to believe that our index and heap disagree.
Forcing a sequential scan with that same query only returns two rows
matching that state group, which suggests that the index thinks there
are more rows in the table than there actually are. (We do not believe
anything can have deleted a row with state group 483128098). Also
interestingly, querying (with the index re-enabled) for 963361875
returns the same row as returned above, so the row is in the index twice.
Another example state group (147961623) should only have a single row
associated with it, and yet the index returns nearly 7000 rows
(including the one we expect). The unexpected state groups are all in
the range 794390760–794393085 (except one in 794411694), and also have
ctids in range (93454823,48) – (93455621,49). The fact that these are
reasonably tight ranges feels suspicious. Note that the state group is
a simple incrementing ID here.
This table is quite large (about 6 TB) but we have sampled a few small
ranges of it and found many instances of this type of corruption, in
the first (approximate) half of the key range (0..561M out of 0..1034M).
For historical reasons, the table and the index are on different
tablespaces, but the same filesystem.
We have sampled the table on our secondary server, and we see the same
sort of corruption going on (though given the size of the data we
don’t actually know if it's exactly the same).
One coincidence is that we started seeing the first symptoms of this
around the same time as libicu was updated with a security patch.
However, postgres hasn’t been restarted and doesn’t reference the new
version in its process maps. Plus state groups are integers anyway. We
also use the C locale, not ICU.
We’re currently running “pg_amcheck --index
state_groups_state_type_idx --heapallindexed” on our secondary to see
what it says, but we expect that to take a long time to complete.
Thankfully, we have database backups so hopefullywe should be able to
restore the data. However, any thoughts on how this happened or where
to look next would be greatly appreciated. Thoughts on how to check
our other indexes for corruption would also be very welcome.
Thanks,
Erik
Further details of our setup:
* 2 servers in physical replication (one primary, one secondary as a
hot standby)
o both servers display the corruption
* ECC RAM
* 8 NVME SSD, raid10 (mdraid), LVM, ext4 filesystem.
o smartctl and mdadm report healthy disks
* Debian, postgres installed via apt.
* Postgres version: PostgreSQL 14.11 (Debian 14.11-1.pgdg120+1) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
* Kernel version 6.1.0-22-amd64, GLIBC 2.36-9+deb12u10
Copyright © 2025 Element - All rights reserved. The Element name, logo
and device are registered trademarks of New Vector Ltd. Registered
number: 10873661. Registered in England and Wales. Registered address:
10 Queen Street Place, London, United Kingdom, EC4R 1AG.
This message is intended for the addressee only and may contain
private and confidential information or material which may be
privileged. If this message has come to you in error please delete it
immediately and do not copy it or show it to any other person.
--
Copyright © 2025 Element - All rights reserved. The Element name, logo
and device are registered trademarks of New Vector Ltd. Registered number:
10873661. Registered in England and Wales. Registered address: 10 Queen
Street Place, London, United Kingdom, EC4R 1AG.
This message is intended
for the addressee only and may contain private and confidential information
or material which may be privileged. If this message has come to you in
error please delete it immediately and do not copy it or show it to any
other person.