Hi hackers,

In recently, I discovered a postgres bug, and I hope I can ask you for the best 
solution.
The problem is as follows:

postgres=# explain analyze select * from xxx where a=500;
ERROR: could not open relation with OID 25989
The structure of my table is as follows:
postgres=# \d xxx
 Table "public.xxx"
 Column | Type  | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a | integer |  | |
 b | text |  | |

postgres=# select count(*) from xxx;
 count 
--------
 800000
(1 row)

postgres=# select * from xxx limit 3;

 a | b
---+----------------------------------
 1 | 203c51477570aa517cfa317155dcc52c
 2 | b58da31baa5c78fee4642fd398bd5909
 3 | c7c475bf0a3ca2fc2afc4812a4d44c58

I opened the log file and saw that the index of table xxx was deleted,

postgres=# drop index CONCURRENTLY idx_xxx ;
DROP INDEX

In order to reproduce this bug, I created and deleted the index again and again 
on the master.
What is hard to understand is that this bug cannot be repeated 100%.
I wrote a script that loops over the master and runs the following two 
sentences.

postgres=# create index idx_xxx on xxx (a);
postgres=# drop index CONCURRENTLY idx_xxx ;
postgres=# create index idx_xxx on xxx (a);
postgres=# drop index CONCURRENTLY idx_xxx ;
...
...
...
At the same time, I started two clients in the standby, 
respectively execute the following sql on the table xxx:

postgres=# explain analyze select * from xxx where a=500;
postgres=# \watch 0.1

After a few minutes, the bug will appear.

I finally confirmed my guess, I used an index scan in the standby query,
but deleted the index on the master at the same time.
Curious, I went to read the source code of Postgres. I found that
 regular DROP INDEX commands imposes a AccessExclusiveLock on the table,
 while drop index concurrently commands only used ShareUpdateExclusiveLock.

As we all know, only AccessExclusiveLock and  AccessShareLock ,a select's  lock 
,
are mutually exclusive, and AccessShareLock can't block 
ShareUpdateExclusiveLock.
This is very weird and not desirable.

This is of course, developers must have thought of this, so we can see in the 
source 
code, before the drop index concurrently, will wait for all transactions using 
this
 index to wait for detection.

 But this only exists on the master, my query is executed on the standby.
 I use the pg_waldump tool to parse the wal file, and analyze the stantup 
process,
 I found that there is no similar operation on the standby, so it will appear 
that 
 when I execute the query on the standby, the index will be deleted by others.


I think this is a bug that will affect the user's experience. we need to fix it.
 I have imagined that the logic that detects the query transaction and
  waits for it to end is implemented on the standby,but this may increase the
 log application delay and the delay is exacerbated that cause the master and 
backup. 
This is not desirable if the query concurrency is large.

All in all, I expect that you can provide a solution that can use drop index 
concurrently 
without affecting the master-slave delay.

Sincerely look forward to your reply and thanks.

adger



Reply via email to