Re: Newbie question - rsync pauses after 500 - 800 files

Neil Schellenberger Mon, 16 Oct 2000 15:49:56 -0700
>>>>> "Dave" == Dave Dykstra <[EMAIL PROTECTED]> writes:

    Dave> On Mon, Oct 16, 2000 at 08:30:14PM +0200, Janåke Rönnblom
    Dave> wrote:
    >> On 16 Oct 2000, at 10:32, Dave Dykstra wrote:
    >> >
    >> > Those of you who are getting a stuck rsync 2.4.6: are you
    >> > running ssh that has not been recompiled with
    >>
    >> I can forget about running rsync 2.4.x, right?

    Dave> [#ifdef HAVE_CYGWIN code showing use of pipes deleted]

    Dave> I don't know.  The characteristics of NT+cygwin may be very
    Dave> different from Unixes.  I don't even know why it makes a
    Dave> difference, just that Andrew Tridgell said it was needed.
    Dave> He had a big buffering hack in rsync 2.3.2 as a workaround
    Dave> for SSH hangs and he took it out in 2.4.0.

Well, I'm somewhat relieved to know that I'm not the only one who is
still having wierd timeout problems with 2.4.6.  Since the issue
seemed to have died down after the 2.4.6 release, I assumed that the
problem was some boneheaded thing that I was doing.  (Well, I suppose
that could still be the problem....)

I'm still seeing "unexpected EOF in read_timeout" on the client side.
Here's the basic config:

  rsync 2.4.6 (plus a couple of local patches, prev. posted here by me)
  gcc 2.95.2 with "-g -O", per default configure
  GNU binutils (assembler/linker) 2.9.1
  build system: SunOS 5.5.1 Generic_103640-29 sun4u sparc SUNW,Ultra-5_10
  server: SunOS 5.6 sun4u sparc SUNW,Ultra-5_10 (unsure about generic)
  client: SunOS 5.6 Generic_105181-10 sun4u sparc SUNW,Ultra-250
  daemon mode server (--daemon)
  using "secrets file" style auth

I have about 100G to sync between these systems over a relatively low
bandwidth link.  This 100G represents about 1M files spread (unevenly)
over about a dozen or so modules.  I was trying all modules nightly,
all in parallel (since the expected per-session data transfer is low
but the diff computation latency is expected to be high; lots of swap
available).

The symptoms: random timeouts (various modules, various files).

So the other day I tried running each module serially.  Miraculously,
all but one module now works consistently, and that one module now
consistently fails.

    Dave> Also, for those of you reporting hangs, here is what Andrew
    Dave> said when he released 2.4.3:

    Dave>     Finally a plea. When reporting a freeze _please_ include
    Dave>     the following or we can't help:

    Dave> - what is the state of the send/receive queues shown with
    Dave>       netstat on the two ends.
    Dave> - what system call is each of the 3 processes stuck in. Use
    Dave>       truss on solaris or strace on Linux.

This is the main reason that I've had to hold off with a bug report:
the server machine is at a remote site, under the control of another
corporate entity, and is very difficult for me to get shell access to.
I keep "meaning" to make the necessary security arrangements to visit,
but I have many, many, other fish to fry.

I've limited myself to spelunking through the source code trying to
see if I can figure out what's going on purely by inspection, but so
far I haven't really had the time to look at this properly.

I hope to get to this before Christmas, but don't hold your breath :-(

Here's what (I think) I've managed to glean so far:

1) Serious re-entrancy related signal handling problem for SIGUSR1.

   It is possible for A LOT of non-trivial behaviour to occur in the
   context of the signal handler upon the receipt of this signal.
   Specifically, exit_cleanup() can call various I/O functions
   (e.g. snprintf(), syslog(), etc.), application functions
   (write_unbuffered(), do_unlink()), and libc functions (free(), via
   do_unlink()).  Things are even worse with keep_partial enabled
   (e.g. finish_transfer()).

   Why is this a problem?  Imagine what happens if the signal was
   delivered, say, while the "main" program was inside of malloc().
   Or already in the middle of write_unbuffered().  Ouch.

   <TEDIOUS class="lecture">
 
   The problem is that on many (virtually all?) unix variants, libc
   (and malloc/free in particular) is not re-entrant.  For example, on
   Solaris it is only legal from signal handlers to call functions
   marked Async-Sig-Safe; very very few functions are so marked (and
   tend to be trivial ones like getpid())..

   Now one does often see programmers "getting away with it" because
   crashes will be infrequent and apparently random.  Also, most
   signal handlers tend to be of the "shutdown" variety and,
   therefore, fatal failures often go unnoticed.  ("Hey it stopped.
   What more do you want?")

   My advice to engineers here, in my role as Senior Annoying Tiresome
   Guy is to utterly avoid "classic" signal handlers if at all
   possible (!).  If non-trivial behaviour is required in response to
   a signal, try to arrange the program architecture around a work
   loop which (among doing its usual stuff) tests a flag (set by the
   signal handler as its only function).

   With the advent of "official" Posix semantics for signal service in
   threads, multi-threaded programs can get away with masking all
   signals in all threads, dedicating the "main()" thread to serially
   handle signals using sigpause() or somesuch, and doing the real
   work in other threads (leaving the inter-thread communication as an
   exercise for the student).

   </TEDIOUS>

   But I doubt that this is the actual timeout problem itself, and I
   haven't time to figure out how to re-architect around it so I'll
   just plunk it on the table here and leave it for the experts to
   ruminate upon.

2) Tends to blow chunks under load.

   Running all of the modules at once seems to be a big problem.  I
   can't make any reasonable statements about memory/swap use until I
   can get physical access to the server machine, but I have no
   particular reason from the syslog on the server to suspect
   out-of-memory.

   Nonetheless, the one module that still fails consistently is
   probably the largest in total number of files (around 220k files).
   It may also be the largest in terms of file space (don't have a
   good estimate on hand).

3) Server actually appears to be exiting relatively "cleanly".

   I always seem to see the following kind of neighbouring syslogs:

      rsyncd[750]: wrote 432468 bytes  read 1325 bytes  total size 2021960482
      rsyncd[750]: transfer interrupted (code 11) at main.c(272)

   The fact that we see the "total size" message suggests to me that
   we got at least as far as the log_exit() call at the start of
   report().  The fact that I don't ever see the requested statistics
   at the client end suggests that the second message is generated by
   an io_error sometime before/during the stats transmition.  The fact
   that we made it as far as main:272 suggests a "clean" exit with an
   io_error=1 (remapped to code=RERR_FILEIO by _exit_cleanup) during
   stats transmission.  Unfortunately, I don't see any evidence of the
   rprintf() messages that seem to always be linked with setting
   io_error=1.  Mystifying.

   Note that I do not yet have hard proof that this is what's
   happening.  Until I can get a truss, snoop, and pstack output from
   the server running with -vvvvvvvv, I'm basically working without a
   net.  In fact, I can't even easily correlate the server and client
   logs because the time bases are not necessarily in sync.  Sigh.

4) Compiler optimiser bug, sparc architecture?

   Some autoconf-based packages are currently being released with
   auto-detection of gcc 2.95 and are forcing off optimisation.  I
   have been unable from a quick investigation to determine exactly
   what bug it is they are trying to work around (or indeed if this is
   simply hysteria caused by the aliasing-related optimisations being
   applied to code that was incorrect to start with.)

   That having been said, there were definitely serious optimiser bugs
   in egcs (the precursor to 2.95), and according to the change
   history these were still being fixed post 2.95.1.

Gotta run.  My rides wating (@%$@! broken ankle).  Welcome any and all
criticism.


Regards,
Neil

-- 
Neil Schellenberger             | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation   | Fax   : (613) 599-2330
350 Terry Fox Drive             | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5   | URL   : http://www.crosskeys.com/
    + Greg Moore (1975-1999), Gentleman racer and great Canadian +
Re: Newbie question - rsync pauses after 500 - 800 files

Reply via email to