On Fri, Nov 19, 2021 at 09:18:23PM -0800, Noah Misch wrote:
> On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote:
> > On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> > > Noah Misch <n...@leadboat.com> writes:
> > > > Each of the three failures happened on a sparc64 Debian+gcc machine.  I 
> > > > had
> > > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > > > without reproducing this.
> > 
> > > #                   'pgbench: error: client 0 script 1 aborted in command 
> > > 4 query 0: ERROR:  could not read two-phase state from WAL at 0/159EF88: 
> > > unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 
> > > 5890048
> > > [1] 
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24
> > 
> > Two others:
> > ERROR:  could not read two-phase state from WAL at 0/16F1850: invalid 
> > record length at 0/16F1850: wanted 24, got 0
> > -- 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15
> > ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect 
> > resource manager data checksum in record at 0/1668020
> > -- 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52

> > I don't have a great theory, but here are candidates to examine next:
> > 
> > - Run with wal_debug=on to confirm logged write location matches read 
> > location.
> > - Run 
> > "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
> >   pg_waldump -s 0/01000000" at the end of the test.
> > - Dump WAL page binary image at the point of failure.
> > - Log which branches in XLogReadRecord() are taken.
> 
> Tom Turelinckx, are you able to provide remote access to kittiwake or
> tadarida?  I'd use it to attempt the above things.  All else being equal,
> kittiwake is more relevant since it's still supported upstream.

Thanks for setting up access.  Summary: this kernel has a bug in I/O syscalls.
How practical is it to update that kernel?  (Userland differs across these
animals, but the kernel does not.)  The kernel on buildfarm member thorntail
doesn't exhibit the bug.

For specifics of the kernel bug, see the attached test program.  In brief, the
bug arises if one process is write()ing or pwrite()ing a file at about the
same time that another process is read()ing or pread()ing the same.  POSIX
says the reader should see the data as it existed before the write or the
newly-written data.  On this kernel, the reader can see zeros instead.  That
leads to the $SUBJECT failure.  PostgreSQL processes write out a given WAL
block 20-30 times in ~10ms, and COMMIT PREPARED reads that block.  The writers
aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
the kernel bug yield the failure.  We could opt to work around that by writing
only the not-already-written portion of a WAL block, but I doubt that's
worthwhile unless it happens to be a performance win anyway.

Separately, while I don't know of relevance to PostgreSQL, I was interested to
see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

> The setup of your buildfarm animals provides a clue.  I understand kittiwake
> is the same as ibisbill except for build options, and tadarida is the same as
> mussurana except for build options.  ibisbill and mussurana haven't failed, so
> one of these is likely needed to reproduce:
> 
>   absence of --enable-cassert
>   CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security '
>   CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2'
>   LDFLAGS='-Wl,-z,relro -Wl,-z,now'

That was a red herring.  ibisbill and mussurana don't use --with-tap-tests.
Adding --with-tap-tests has been enough to make their configurations witness
the same kinds of failures.

nm
/*
 * Stress-test pread(), pwrite(), read(), and write() to detect a few problems
 * with their handling of regular files:
 *
 * - Lack of atomicity.
 *   
https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
 *   requires atomicity.  (An implementation may always return <= 1; if it
 *   chooses to return higher values, it must maintain atomicity.)
 *
 * - Transiently making readers see zeros when they read concurrently with a
 *   write, even if the file had no zero at that offset before or after the
 *   write.
 *
 * By default, this program will print "mismatch" messages if pwrite() is
 * non-atomic.  Build with other options to test other behaviors:
 * -DCHANGE_CONTENT=0 tests the zeros bug instead of plain atomicity
 * -DUSE_SEEK=1 tests write()/read() instead of pwrite()/pread()
 * -DREOPEN=0 reuses the same file descriptor across iterations
 * -DXLOG_BLCKSZ=32 tests a different byte count
 *    high values may require "ulimit -Ss" changes
 *
 *
 * Observed behaviors:
 *
 * Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009):
 * pwrite/pread is non-atomic if count>16 (no -D switches)
 * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
 *
 * Linux 4.9.0-13-sparc64-smp (Debian):
 * pwrite/pread is non-atomic if count>4 (no -D switches)
 * write/read is non-atomic if count>4 (-DUSE_SEEK)
 * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
 * pwrite/pread has zeros bug for count>127 (-DCHANGE_CONTENT=0)
 * pwrite/pread w/o REOPEN also has zeros bug for count>127 (-DCHANGE_CONTENT=0 
-DREOPEN=0)
 * write/read has zeros bug for count>127 (-DUSE_SEEK -DCHANGE_CONTENT=0)
 * write/read w/o REOPEN is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 
-DREOPEN=0)
 *
 * Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
 * pwrite/pread is atomic (-DXLOG_BLCKSZ=8192000)
 * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DCHANGE_CONTENT=0)
 */

#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

/* Size of block written in a loop and read in another process's loop. */
#ifndef XLOG_BLCKSZ
#define XLOG_BLCKSZ 8192
#endif
/*
 * Offset of that block within file.  Probably not important.  This was a
 * value seen in a PostgreSQL failure.
 */
#ifndef OFFSET
#define OFFSET 0x72E0A0
#endif
/* Replace pwrite() w/ lseek()+write()?  Likewise pread(). */
#ifndef USE_SEEK
#define USE_SEEK 0
#endif
/*
 * Write a different payload to the block each time?  Disable this to detect
 * the transient-zeros bug.  Keep enabled to detect mere lack of atomicity.
 */
#ifndef CHANGE_CONTENT
#define CHANGE_CONTENT 1
#endif
/* Close and reopen the file before each read or write operation? */
#ifndef REOPEN
#define REOPEN 1
#endif


#if USE_SEEK
static ssize_t
my_pread(int fd, void *buf, size_t count, off_t offset)
{
        if (lseek(fd, offset, SEEK_SET) == (off_t) -1)
                perror("lseek");
        return read(fd, buf, count);
}
#else
static ssize_t
my_pread(int fd, void *buf, size_t count, off_t offset)
{
        return pread(fd, buf, count, offset);
}
#endif

#if USE_SEEK
static ssize_t
my_pwrite(int fd, const void *buf, size_t count, off_t offset)
{
        if (lseek(fd, offset, SEEK_SET) == (off_t) -1)
                perror("lseek");
        return write(fd, buf, count);
}
#else
static ssize_t
my_pwrite(int fd, const void *buf, size_t count, off_t offset)
{
        return pwrite(fd, buf, count, offset);
}
#endif

#if REOPEN
static int
reopen(void)
{
        int                     fd = open("io-rectitude.tmp", O_RDWR);

        if (fd == -1)
                perror("open");
        return fd;
}
#endif

static void
writer(int fd)
{
        unsigned char byte_val = 1;

        for (;;)
        {
                char            buf[XLOG_BLCKSZ];

                memset(buf, byte_val, sizeof(buf));
#if REOPEN
                close(fd);
                fd = reopen();
#endif
                errno = 0;
                if (my_pwrite(fd, buf, sizeof(buf), OFFSET) != sizeof(buf))
                        perror("pwrite");       /* not a defect */
#if CHANGE_CONTENT
                byte_val++;
#endif
        }
}

static void
reader(int fd)
{
        for (;;)
        {
                unsigned char buf[XLOG_BLCKSZ];
                int                     i;
#if REOPEN
                close(fd);
                fd = reopen();
#endif
                errno = 0;
                if (my_pread(fd, buf, sizeof(buf), OFFSET) != sizeof(buf))
                        perror("pread");        /* not a defect */
                else
                {
                        /* writer() installs the same value in every position. 
*/
                        for (i = 0; i < sizeof(buf); ++i)
                        {
                                if (buf[i] != buf[0])
                                {
                                        fprintf(stderr, "mismatch at %d: %d vs. 
%d\n",
                                                        i, buf[i], buf[0]);
                                        break;
                                }
                        }
                }
        }
}

int
main(int argc, char **argv)
{
        int                     fd;
        unsigned char buf[OFFSET + XLOG_BLCKSZ];
        pid_t           pid;

        fd = open("io-rectitude.tmp", O_RDWR | O_CREAT | O_TRUNC, 0666);
        if (fd == -1)
        {
                perror("open");
                return 1;
        }
        /* initialize file contents */
        memset(buf, 1, sizeof(buf));
        if (write(fd, buf, sizeof(buf)) != sizeof(buf))
        {
                perror("write");
                return 1;
        }

        pid = fork();
        if (pid == -1)
        {
                perror("fork");
                return 1;
        }
        else if (pid == 0)
                writer(fd);
        else
                reader(fd);

        return 0;                                       /* unreachable */
}

Reply via email to