[ Another excruciatingly long post. Sorry. ]
Folks,
I think I may finally have the answer to the riddle of (at least one
kind of) the dreaded unexpected EOF. One caveat though: I don't have
(easy) access to the server to actually observe directly on it; I'm
going on strictly on truss/snoop of the client. But everything I'm
seeing on the client bears out the following scenario.
Again this is with 2.4.6, daemonised, on Solaris 2.6. It looks as
though what is happening is roughly as follows:
server:0: daemonises and waits for connections
client:0: starts via start_socket_client()
server:0: accepts and forks server:1, returns to wait for more conns
server:1: starts via start_daemon()
client:0: sends module, drops into client_run()
server:1: gets module, drops into rsync_module()
server:1: drops into start_server()
server:1: recv_exclude_list()
client:0: send_exclude_list()
server:1: drops into do_server_sender()
client:0: recv_file_list()
server:1: send_file_list()
server:1: send_files()
client:0: get_local_name()
client:0: drops into do_recv()
client:0: delete_files()
client:0: forks client:1 as recv proc, client:0 becomes gen proc
client:0: generate_files()
client:1: recv_files()
... (and away we [should] go)
The problem appears to be this: if the file list is very extensive,
the amount of time needed on the server to compile and transmit it,
compounded by the client time to receive it, delete_files, and
generate_files (at least to the point where an io_flush() is needed)
can exceed the io_timeout of the server (which is waiting for a -1
handshake at the read_int() at the start of send_files()).
So the server times out. No error message would be printed because it
is both a server and a daemon (which disables the rprintf in
check_timeout()). Then server:1 calls exit_cleanup(RERR_TIMEOUT),
causing a socket close at the client. [N.B. this is all a guess on my
part since I can't directly monitor the server, but it seems
reasonable from an inspection of the code.]
The socket close is noted immediately by client:1 when it arrives at
the start of recv_files(), waiting on a -1 handshake via read_int().
Thus, a first unexpected EOF message is generated (ever notice how
they always seem to appear in pairs?)
Meanwhile client:0 merrily churns through generate_files(), which if
almost everything is up to date, doesn't need to do very much I/O.
The few bytes that it does need to write are buffered in any case
because this part of the code is in a region where
io_start_buffering() is active. Thus, until the buffer becomes full
(or processing completes) and io_flush() is called, no actual calls
are made to writefd_unbuffered().
Eventually, though, writefd_unbuffered() will be called. The last
recv_generator() log message will vary depending on how many files
need updates (i.e. how many bytes are written to the buffer). The
call to writefd_unbuffered() will trigger a call to select() with both
a write mask (with f_out) and also a read mask (with io_error_fd).
Since the sibling client:1 has exited (closing its end of the error
pipe) there will be a pending EOF on io_error_fd. This will trigger a
call to read_error_fd(), which calls read_loop(), which calls
read_unbuffered() on the io_error_fd. Which is EOF. Hence unexpected
EOF number two. [FYI, I've captured a stack trace of this situation.]
I think that the reason SIGPIPE is avoided is that nothing is actually
written to the (closed) socket because everything is being buffered.
And when io_flush() is called, it handles the read mask first and
never actually gets around to writing anything.
This overall sequence of events is also supported by the snoop output.
(The snoop utility is a Solaris utility which allows monitoring of IP
traffic; it is very similar to tcpdump et al.) The snoop output shows
the send_file_list() data, followed by an io_timeout duration silence,
followed by an EOF indication (i.e. the server has closed the
connection). The netstat output indicates CLOSE_WAIT.
This sequence also seems to explain the apparently random nature of
the problem: why some don't have it at all (they probably have
--timeout=0, the default); why it seems worse with large file sets
(bigger window of opportunity); or on heavily loaded filesystems
(takes longer to compute the file list). It also explains why using
more -v options seems to trigger it: it takes progressively longer to
output all of those debugging messages (especially to a tty).
Also, I haven't looked at the rsh/ssh code paths but I assume that
they ultimately use do_recv() as well and will suffer from the same
problem.
On the other hand, it doesn't really seem to explain the "hangs" or
"stalls" that some seem to be reporting. That's still a bit of a
mystery to me. Nor does it explain the "transfer interrupted
(code=11)" in the server logs - where did the associated error message
go? Is io_multiplex_write() somehow eating it? Or is it a red
herring of some kind?
Anyway, I'll be testing this out with a test client/server pair using
a small fileset and a very short timeout as soon as I can get it set
up. (I don't have the disk or network resources to duplicate the 100G
problem locally and, regretably, --timeout is the only option
unconditionally overriden at the server end, meaning that I can't just
force it from my end.)
I just wanted to get this posted so that I can start the day tomorrow
with the "overnight" reaction from those in other time zones. In
particular, is anyone who is using --timeout=0 (the default) on the
server nonetheless seeing unexpected EOF?
Regards,
Neil
--
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation | Fax : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5 | URL : http://www.crosskeys.com/
+ Greg Moore (1975-1999), Gentleman racer and great Canadian +