On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <n...@leadboat.com> wrote: > > Further testing showed it was a file location problem, not a deletion problem. > The worker tried to open > base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these > were the files actually existing: > > [nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find > src/test/subscription/tmp_check -name '*sharedfileset*') > src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset: > total 408 > drwx------ 2 nm usr 256 Dec 08 03:20 . > drwx------ 4 nm usr 256 Dec 08 03:20 .. > -rw------- 1 nm usr 207806 Dec 08 03:20 16393-510.changes.0 > > src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset: > total 0 > drwx------ 2 nm usr 256 Dec 08 03:20 . > drwx------ 4 nm usr 256 Dec 08 03:20 .. > -rw------- 1 nm usr 0 Dec 08 03:20 16393-511.changes.0 > > > > I have executed "make check" in the loop with only this file. I have > > > repeated it 5000 times but no failure, I am wondering shall we try to > > > execute in the same machine in a loop where it failed once? > > > > Yes, that might help. Noah, would it be possible for you to try that > > The problem is xidhash using strcmp() to compare keys; it needs memcmp(). For > this to matter, xidhash must contain more than one element. Existing tests > rarely exercise the multi-element scenario. Under heavy load, on this system, > the test publisher can have two active transactions at once, in which case it > does exercise multi-element xidhash. (The publisher is sensitive to timing, > but the subscriber is not; once WAL contains interleaved records of two XIDs, > the subscriber fails every time.) This would be much harder to reproduce on a > little-endian system, where strcmp(&xid, &xid_plus_one)!=0. On big-endian, > every small XID has zero in the first octet; they all look like empty strings. >
Your analysis is correct. > The attached patch has the one-line fix and some test suite changes that make > this reproduce frequently on any big-endian system. I'm currently planning to > drop the test suite changes from the commit, but I could keep them if folks > like them. (They'd need more comments and timeout handling.) > I think it is better to keep this test which can always test multiple streams on the subscriber. Thanks for working on this. -- With Regards, Amit Kapila.