>>>>> "Dave" == Dave Dykstra <[EMAIL PROTECTED]> writes:
Dave> On Mon, Oct 16, 2000 at 08:30:14PM +0200, Janåke Rönnblom
Dave> wrote:
>> On 16 Oct 2000, at 10:32, Dave Dykstra wrote:
>> >
>> > Those of you who are getting a stuck rsync 2.4.6: are you
>> > running ssh that has not been recompiled with
>>
>> I can forget about running rsync 2.4.x, right?
Dave> [#ifdef HAVE_CYGWIN code showing use of pipes deleted]
Dave> I don't know. The characteristics of NT+cygwin may be very
Dave> different from Unixes. I don't even know why it makes a
Dave> difference, just that Andrew Tridgell said it was needed.
Dave> He had a big buffering hack in rsync 2.3.2 as a workaround
Dave> for SSH hangs and he took it out in 2.4.0.
Well, I'm somewhat relieved to know that I'm not the only one who is
still having wierd timeout problems with 2.4.6. Since the issue
seemed to have died down after the 2.4.6 release, I assumed that the
problem was some boneheaded thing that I was doing. (Well, I suppose
that could still be the problem....)
I'm still seeing "unexpected EOF in read_timeout" on the client side.
Here's the basic config:
rsync 2.4.6 (plus a couple of local patches, prev. posted here by me)
gcc 2.95.2 with "-g -O", per default configure
GNU binutils (assembler/linker) 2.9.1
build system: SunOS 5.5.1 Generic_103640-29 sun4u sparc SUNW,Ultra-5_10
server: SunOS 5.6 sun4u sparc SUNW,Ultra-5_10 (unsure about generic)
client: SunOS 5.6 Generic_105181-10 sun4u sparc SUNW,Ultra-250
daemon mode server (--daemon)
using "secrets file" style auth
I have about 100G to sync between these systems over a relatively low
bandwidth link. This 100G represents about 1M files spread (unevenly)
over about a dozen or so modules. I was trying all modules nightly,
all in parallel (since the expected per-session data transfer is low
but the diff computation latency is expected to be high; lots of swap
available).
The symptoms: random timeouts (various modules, various files).
So the other day I tried running each module serially. Miraculously,
all but one module now works consistently, and that one module now
consistently fails.
Dave> Also, for those of you reporting hangs, here is what Andrew
Dave> said when he released 2.4.3:
Dave> Finally a plea. When reporting a freeze _please_ include
Dave> the following or we can't help:
Dave> - what is the state of the send/receive queues shown with
Dave> netstat on the two ends.
Dave> - what system call is each of the 3 processes stuck in. Use
Dave> truss on solaris or strace on Linux.
This is the main reason that I've had to hold off with a bug report:
the server machine is at a remote site, under the control of another
corporate entity, and is very difficult for me to get shell access to.
I keep "meaning" to make the necessary security arrangements to visit,
but I have many, many, other fish to fry.
I've limited myself to spelunking through the source code trying to
see if I can figure out what's going on purely by inspection, but so
far I haven't really had the time to look at this properly.
I hope to get to this before Christmas, but don't hold your breath :-(
Here's what (I think) I've managed to glean so far:
1) Serious re-entrancy related signal handling problem for SIGUSR1.
It is possible for A LOT of non-trivial behaviour to occur in the
context of the signal handler upon the receipt of this signal.
Specifically, exit_cleanup() can call various I/O functions
(e.g. snprintf(), syslog(), etc.), application functions
(write_unbuffered(), do_unlink()), and libc functions (free(), via
do_unlink()). Things are even worse with keep_partial enabled
(e.g. finish_transfer()).
Why is this a problem? Imagine what happens if the signal was
delivered, say, while the "main" program was inside of malloc().
Or already in the middle of write_unbuffered(). Ouch.
<TEDIOUS class="lecture">
The problem is that on many (virtually all?) unix variants, libc
(and malloc/free in particular) is not re-entrant. For example, on
Solaris it is only legal from signal handlers to call functions
marked Async-Sig-Safe; very very few functions are so marked (and
tend to be trivial ones like getpid())..
Now one does often see programmers "getting away with it" because
crashes will be infrequent and apparently random. Also, most
signal handlers tend to be of the "shutdown" variety and,
therefore, fatal failures often go unnoticed. ("Hey it stopped.
What more do you want?")
My advice to engineers here, in my role as Senior Annoying Tiresome
Guy is to utterly avoid "classic" signal handlers if at all
possible (!). If non-trivial behaviour is required in response to
a signal, try to arrange the program architecture around a work
loop which (among doing its usual stuff) tests a flag (set by the
signal handler as its only function).
With the advent of "official" Posix semantics for signal service in
threads, multi-threaded programs can get away with masking all
signals in all threads, dedicating the "main()" thread to serially
handle signals using sigpause() or somesuch, and doing the real
work in other threads (leaving the inter-thread communication as an
exercise for the student).
</TEDIOUS>
But I doubt that this is the actual timeout problem itself, and I
haven't time to figure out how to re-architect around it so I'll
just plunk it on the table here and leave it for the experts to
ruminate upon.
2) Tends to blow chunks under load.
Running all of the modules at once seems to be a big problem. I
can't make any reasonable statements about memory/swap use until I
can get physical access to the server machine, but I have no
particular reason from the syslog on the server to suspect
out-of-memory.
Nonetheless, the one module that still fails consistently is
probably the largest in total number of files (around 220k files).
It may also be the largest in terms of file space (don't have a
good estimate on hand).
3) Server actually appears to be exiting relatively "cleanly".
I always seem to see the following kind of neighbouring syslogs:
rsyncd[750]: wrote 432468 bytes read 1325 bytes total size 2021960482
rsyncd[750]: transfer interrupted (code 11) at main.c(272)
The fact that we see the "total size" message suggests to me that
we got at least as far as the log_exit() call at the start of
report(). The fact that I don't ever see the requested statistics
at the client end suggests that the second message is generated by
an io_error sometime before/during the stats transmition. The fact
that we made it as far as main:272 suggests a "clean" exit with an
io_error=1 (remapped to code=RERR_FILEIO by _exit_cleanup) during
stats transmission. Unfortunately, I don't see any evidence of the
rprintf() messages that seem to always be linked with setting
io_error=1. Mystifying.
Note that I do not yet have hard proof that this is what's
happening. Until I can get a truss, snoop, and pstack output from
the server running with -vvvvvvvv, I'm basically working without a
net. In fact, I can't even easily correlate the server and client
logs because the time bases are not necessarily in sync. Sigh.
4) Compiler optimiser bug, sparc architecture?
Some autoconf-based packages are currently being released with
auto-detection of gcc 2.95 and are forcing off optimisation. I
have been unable from a quick investigation to determine exactly
what bug it is they are trying to work around (or indeed if this is
simply hysteria caused by the aliasing-related optimisations being
applied to code that was incorrect to start with.)
That having been said, there were definitely serious optimiser bugs
in egcs (the precursor to 2.95), and according to the change
history these were still being fixed post 2.95.1.
Gotta run. My rides wating (@%$@! broken ankle). Welcome any and all
criticism.
Regards,
Neil
--
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation | Fax : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5 | URL : http://www.crosskeys.com/
+ Greg Moore (1975-1999), Gentleman racer and great Canadian +