Hi Hackers,

Attachment: postgresql-writeall.patch
Description: Binary data

I run a number of postgresql installations on NFS and on the whole I find this 
to be very reliable.  I have however run into a few issues when there is 
concurrent writes from multiple processes.

I see errors such as the following:

2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: 
user=postgres database=postgres
2011-07-31 22:13:35 EST    ERROR:  could not write block 1 of relation 
global/2671: wrote only 4096 of 8192 bytes
2011-07-31 22:13:35 EST    HINT:  Check free disk space.
2011-07-31 22:13:35 EST    CONTEXT:  writing block 1 of relation global/2671
2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: 
host=[local]

I have also seen similar errors coming out of the WAL writer, however they 
occur at the level PANIC, which is a little more drastic.

After spending some time with debug logging turned on and even more time 
staring at strace, I believe this occurs when one process was writing to a data 
file and it received a SIGINT from another process, eg:
(These logs are from another similar run)

[pid  1804] <... fsync resumed> )       = 0
[pid 10198] kill(1804, SIGINT <unfinished ...>
[pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
[pid  1804] write(3, 
"c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 
<unfinished ...>
[pid 10198] <... kill resumed> )        = 0
[pid  1804] <... write resumed> )       = 4096
[pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
[pid  1804] rt_sigreturn(0x2)           = 4096
[pid  1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260
[pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>
[pid  1802] <... select resumed> )      = 1 (in [5], left {0, 999000})
[pid  1804] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid  1804] tgkill(1804, 1804, SIGABRT) = 0
[pid  1802] read(5,  <unfinished ...>
[pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
Process 1804 detached

After finding this, I came up with the following test case which easily 
replicated our issue:

#!/bin/bash

name=$1
number=1
while true; do 
  /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT 
NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';"
  /usr/bin/createdb -E UNICODE -O $name$number $name$number
  if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
    exit
  fi
  let number=$number+1
done

When I run a single copy of this script, I have no issues, however when I start 
up a few more copies to simultaneously hit the DB, it crashes quiet quickly - 
usually within 20 or 30 seconds.

After looking through the code I found that when postgres calls write() it 
doesn't retry.  In order to address the issue with the PANIC in the WAL writer 
I set the sync method to o_sync which solved the issue in that part of the 
code, however I was still seeing failures in other areas of the code (such as 
the FileWrite function).  Following this, I spoke to an NFS guru who pointed 
out that writes under linux are not guaranteed to complete unless you open up 
O_SYNC or similar on the file handle.  I had a look in the libc docs and found 
this:

http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html

----
The write function writes up to size bytes from buffer to the file with 
descriptor filedes. The data in buffer is not necessarily a character string 
and a null character is output like any other character.

The return value is the number of bytes actually written. This may be size, but 
can always be smaller. Your program should always call write in a loop, 
iterating until all the data is written.
----

After finding this, I checked a number of other pieces of software that we see 
no issues with on NFS (such as the JVM) for their usage of write().  I 
confirmed they write in a while loop and set about patching the postgres source.

I have made this patch against 8.4.8 and confirmed that it fixes the issue we 
see on our systems.  I have also checked that make check still passes. 

As my C is terrible, I would welcome any comments on the implementation of this 
patch.

Best regards,

George





 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to