On Tue, Oct 24, 2000 at 04:57:05PM -0500, Dave Dykstra wrote:
> > On Fri, Oct 20, 2000 at 10:44:04AM +1000, Andrew Tridgell wrote:
> > ...
> > > Stephen tells me that a patch went into the 2.2.17 Linux kernel that
> > > was supposed to fix this particular problem. If you get a chance it
> > > would be worth trying that kernel (or a 2.2.18preX) to see if it
> > > solves your problem.
...
> I and a colleague have now done some sniffing and it does indeed look like
> a Linux bug, but not the one you describe.  In fact, we once saw just the
> opposite where the Solaris box sent data outside the window the Linux box
> was offering, and the Linux box ignored it.  We don't think that's necessarily
> an error; we figured the machine that sends the extra data may be in the
> process of sending an ack anyway and figure since it's got data to send it
> may as well send it in the hopes that the window on the receiver will have
> opened up in the meantime.
> 
> The problem we see is that when the Linux machine's *receive* queue gets
> quite full and rsync isn't ready to receive it, the Linux machine seems
> unable to send data out of its *send* queue; the tcpdump (attached) shows
> that Solaris has told Linux that it's ready to receive more but Linux
> doesn't send it.  I figure that since the Solaris machine is the receiver
> and it has the two rsync processes, the rsync process that's generating the
> checksums for the receive side is blasting a lot of data at the single
> sending rsync process on the Linux machine which isn't ready to accept it
> yet because it is sending a lot of data in the other direction.  The rsync
> sender would eventually get to read the receive data if Linux would just
> transmit its send data.
> 
> Do you know who worked on the patch for 2.2.17?  I'd like to get this
> information to them.


Tridge, please put me in touch with this Stephen or whomever else you know
who may have worked on the Linux TCP fix that was supposed to help that
problem uncovered by rsync.  We have been unsuccessful in coming up with a
simple program to reproduce the failure, and the prospect of digging into
the Linux TCP code with no background and without assistance from somebody
who is knowledgable of it is daunting to us.  I would think that somebody
who has already worked on a fix to Linux TCP related to rsync would be
motivated to help.




> We have reproduced this problem between Linux 2.2.17 & Solaris 7 on a LAN,
> in fact two machines on the same etherswitch.  We have also tried it between
> Linux 2.2.18preX and the same Solaris 7 machine, over a WAN.  We didn't see
> a problem between the two Linux machines.  The tcpdump shows the 2.2.17
> Linux machine called "static" which is analagous to "expbuild" in the
> above netstat.
> 
> For somebody else to reproduce this they should be able to first populate
> a directory on a Solaris machine with the data from
>     rsync://www.bell-labs.com/wwexptools/rsynctst/m17/
> and then offer the data that is now at
>     rsync://www.bell-labs.com/wwexptools/rsynctst/m18/
> from an rsync 2.4.6 daemon on a Linux machine and then initiate an rsync
> 2.4.6 client on the Solaris machine to pull it down on top of the m17
> data.  We're going to try to work on a smaller test program to reproduce
> this.


It turns out that the rsync server on the www.bell-labs.com visible on the
internet wasn't working for a couple days after I posted this message (we
have a separate machine on the inside so I didn't notice) and nobody
complained so apparently nobody else is trying to use this data to
reproduce the problem yet.  It's too bad we haven't come up with a simpler
way to reproduce the problem, but at least this way shows it quite
consistently.

- Dave Dykstra

Reply via email to