On Fri, 22 Feb 2002, Dave Dykstra wrote:

> [adding the mailing list to the Cc after David sent me trusses and other
> info about his hang.  I have trimmed down the tracing info to the relevant
> pieces.]
>
> On Tue, Feb 19, 2002 at 11:51:27PM -0500, David Birnbaum wrote:
> > I wasn't able to get a netstat this time.  However, I should be able to
> > grab it again if these truss's don't help.  I'm thinking I will run a
> > snoop the next time and capture the packet traffic to see what's happening
> > at the TCP level - at least we should be able to see who/what shut down
> > first, because from the truss, it looks like the kernel shut down the
> > whole thing at once.  It's tough to time the netstat because it takes a
> > while for it to die.
>
> The client is doing nothing complex, it's just merrilly blasting away down
> file descriptor 4 and never reading it.  It then hangs and the only reason
> it wakes up is because the connection has been torn down.
>
> The server process looks very odd.  It is doing a 196K read from a file, and
> it takes 9 minutes to return from that, and when it wakes up from that it
> finds that an EPIPE from the connection to the rsync client.  Why would a
> 196K read take 9 minutes?  It looks like it is going over NFS, so maybe the
> problem is with NFS.  What's the network between the NFS server and the
> rsync server?  Is it possible to run rsync on the NFS server itself, or is
> that a network appliance or something?  It could be that the TCP
> implementation on the client side wasn't willing to wait for a response for
> 9 minutes and that's why it shut things down with an EPIPE.
>
> At the same time, the server child process is doing a 262K read from another
> file over the same NFS connection and it takes the same 9 minutes and
> eventually suceeds.  It only dies because the other server process had died
> and it gets a signal.

What makes you think it's going over NFS?  Everything being backed up is
local, and the backup disk is local...as best as I can tell.  The lsof
stuff all points to local devices as well.

The network is a 100MB switched network, and while the backup server is
busy at night doing the backups, when I was running the truss neither
server was particularly busy.

The plot thickens...it's worked two days in a row now.  But, the next time
it fails I will get a snoop and see if TCP lends any clues.  Is it
possible that there's some sort of memory clearing, cleanup, or other
event going on that could cause the occassional lapse like this?  We
certainly haven't seen any other networking problems that would explain
this.

David.

-----

> > Content-Description: client lsof
> > COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
> ...
> > rsync   20039 root    4u  IPv4 0x3000dbd3810 0t64514407     TCP 
>corona.chelsea.net:51816->hite.chelsea.net:rsync (ESTABLISHED)
> > rsync   20039 root    5r  VREG        32,359   52273152  243313 
>/export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.dat
>
> > Content-Description: client truss
> ...
> > 3504.6000   write(4, " w s   9 8 ;   V Z _ I E".., 1021)    = 1021
> > 3504.6003   poll(0xFFBE5CA0, 1, 60000)                      = 1
> > 3504.6005   write(4, "03FCFFFF", 4)                         = 4
> > 3504.6010   poll(0xFFBE5CA0, 1, 60000)                      = 1
> > 3504.6012   write(4, "02FCFFFF", 4)                         = 4
> > 3504.6016   poll(0xFFBE5CA0, 1, 60000)                      = 1
> > 3504.6019   write(4, "01FCFFFF", 4)                         = 4
> > 3504.6034   poll(0xFFBE5CA0, 1, 60000)                      = 1
> > 3504.6036   write(4, "E306\0\0", 4)                         = 4
> > 3504.6041   poll(0xFFBE5D18, 1, 60000)                      = 1
> > 3504.6046   write(4, " s t 1 0 3 . t n t 4 . r".., 1763)    = 1763
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 3564.6045   poll(0xFFBE5CA0, 1, 60000)                      = 0
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 3624.6045   poll(0xFFBE5CA0, 1, 60000)                      = 0
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 3684.6045   poll(0xFFBE5CA0, 1, 60000)                      = 0
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 3744.6046   poll(0xFFBE5CA0, 1, 60000)                      = 0
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 3804.6046   poll(0xFFBE5CA0, 1, 60000)                      = 0
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 3864.6044   poll(0xFFBE5CA0, 1, 60000)                      = 0
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 3924.6041   poll(0xFFBE5CA0, 1, 60000)                      = 0
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 3984.6041   poll(0xFFBE5CA0, 1, 60000)                      = 0
> > poll(0xFFBE5CA0, 1, 60000)  (sleeping...)
> > 4037.3040   poll(0xFFBE5CA0, 1, 60000)                      = 1
> > 4037.3046   write(4, "FEFBFFFF", 4)                         Err#32 EPIPE
> > 4037.3055       Received signal #13, SIGPIPE [caught]
> > 4037.3064   sigaction(SIGUSR1, 0xFFBE5838, 0xFFBE58B8)      = 0
> > 4037.3071   sigaction(SIGUSR2, 0xFFBE5838, 0xFFBE58B8)      = 0
> > 4037.3074   write(2, " r s y n c   e r r o r :".., 66)      = 66
> > 4037.3084   mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, 
>MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF310000
> > 4037.3088   munmap(0xFF310000, 8192)                        = 0
> > 4037.3091   llseek(0, 0, SEEK_CUR)                          = 583129
> > 4037.3094   _exit(20)
>
>
> > Content-Description: server rsync lsof
> > COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
> ...
> > rsync   27948 root    5u  IPv4 0x30000a59710 0t64477429     TCP 
>hite.chelsea.net:rsync->corona.chelsea.net:51816 (ESTABLISHED)
> > rsync   27948 root    6uw VREG         85,30          0  541128 /export 
>(/dev/md/dsk/d30)
> > rsync   27948 root    7r  VREG        32,144   19062784 5347011 /export/backup -- 
>export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.idx
>
> > Content-Description: server truss
> ...
> > 3497.9255   
>lstat64("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/resistance-forum-www.thebody.com/page.dat",
> 0xFFBEE370) = 0
> > 3497.9262   
>open64("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/resistance-forum-www.thebody.com/page.dat",
> O_RDONLY) = 7
> > 3497.9869   read(7, "\002FFB7\0\0\0\0\002E0EF".., 196536)   = 196536
> > poll(0xFFBEC6E0, 2, 60000)  (sleeping...)
> > 4037.2043   poll(0xFFBEC6E0, 2, 60000)                      = 0
> > 4037.2049   poll(0xFFBEC6E0, 2, 60000)                      = 1
> > 4037.2052   write(5, "FC0F\0078E hB40698 .84EF".., 4096)    Err#32 EPIPE
> > 4037.2816       Received signal #13, SIGPIPE [caught]
> > 4037.2843   sigaction(SIGUSR1, 0xFFBEC2B0, 0xFFBEC330)      = 0
> > 4037.2848   sigaction(SIGUSR2, 0xFFBEC2B0, 0xFFBEC330)      = 0
> > 4037.2851   getpid()                                        = 27948 [27130]
> > 4037.3504   kill(28192, SIGUSR1)                            = 0
> > 4037.3508   getpid()                                        = 27948 [27130]
> > 4037.3514   poll(0xFFBEABB0, 2, 60000)                      = 1
> > 4037.3518   write(5, " B\0\0\b r s y n c   e r".., 70)      Err#32 EPIPE
> > 4037.3523       Received signal #13, SIGPIPE [default]
> >     *** process killed ***
>
>
> > Content-Description: server child lsof
> > COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
> ...
> > rsync   28192 root    5u  IPv4 0x30000a59710 0t64477429     TCP 
>hite.chelsea.net:rsync->corona.chelsea.net:51816 (ESTABLISHED)
> > rsync   28192 root    6u  VREG         85,30          0  541128 /export 
>(/dev/md/dsk/d30)
> > rsync   28192 root    7r  VREG        32,144   51363840 5347010 /export/backup -- 
>export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.dat
> > rsync   28192 root    8u  unix        105,36        0t0  306297 
>/devices/pseudo/tl@0:ticots->(socketpair: 0x18d400000000) (0x300016fd448)
> > rsync   28192 root    9u  VREG        32,144   37569724 5347009 /export/backup -- 
>export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/.view.dat.Q9aae3
> > rsync   28192 root   10u  unix        105,39        0t0  306297 
>/devices/pseudo/tl@0:ticots->(socketpair: 0x18d600000000) (0x300016fc368)
>
> > Content-Description: server child truss
> ...
> > 3504.2667   poll(0xFFBECF20, 1, 60000)                      = 1
> > 3504.2682   read(5, "D2\t\0\0", 4)                          = 4
> > 3504.2699   poll(0xFFBECF98, 1, 60000)                      = 1
> > 3504.2714   read(5, " I E   6 . 0 ;   W i n d".., 2514)     = 2514
> > 3504.2738   write(9, " I E   6 . 0 ;   W i n d".., 1024)    = 1024
> > 3504.2755   write(9, " . 0 ;   W i n d o w s  ".., 1024)    = 1024
> > 3504.2771   write(9, " w e r P C )\0 M9BCE9A 1".., 466)     = 466
> > 3504.2787   poll(0xFFBECF20, 1, 60000)                      = 1
> > 3504.2800   read(5, "E4E6FFFF", 4)                          = 4
> > 3504.2822   llseek(7, 5898240, SEEK_SET)                    = 5898240
> > 3517.5204   read(7, "\0 Y  \0\0\0\0\0\0 E16C5".., 262144)   = 262144
> > 4037.2227   write(9, " d o w s   9 5 )\0 T\v C".., 928)     = 928
> > 4037.2233   poll(0xFFBECF20, 1, 60000)                      = 1
> > 4037.2238   read(5, "E3E6FFFF", 4)                          = 4
> > 4037.2244   write(9, " 4 . 0   ( c o m p a t i".., 928)     = 928
> > 4037.2258   poll(0xFFBECF20, 1, 60000)                      = 1
> > 4037.2262   read(5, "E2E6FFFF", 4)                          = 4
> > 4037.2269   write(9, " c o m : M o z i l l a /".., 928)     = 928
> > 4037.2274   poll(0xFFBECF20, 1, 60000)                      = 1
> > 4037.2278   read(5, "0E1A\0\0", 4)                          = 4
> > 4037.2283   poll(0xFFBECF98, 1, 60000)                      = 1
> > 4037.2288   read(5, " M o z i l l a / 4 . 0  ".., 6670)     = 6670
> > 4037.2304   write(9, " M o z i l l a / 4 . 0  ".., 1024)    = 1024
> > 4037.3011   write(9, " . c o m : M o z i l l a".., 1024)    = 1024
> > 4037.3017   write(9, " 2 4 7 - 2 9 - 0 . c l i".., 1024)    = 1024
> > 4037.3022   write(9, " 1 ;   Q 3 1 2 4 6 1 )\0".., 1024)    = 1024
> > 4037.3028   write(9, " M S I E C r a w l e r )".., 1024)    = 1024
> > 4037.3079   llseek(9, 1, SEEK_CUR)                          = 585779
> > 4037.3085   write(9, " WB7 lA0 1 2 - 2 5 5 - 1".., 1023)    = 1023
> > 4037.3125   write(9, " 0 7 2 1 2 0 0 0 h . 0 8".., 526)     = 526
> > 4037.3636   poll(0xFFBECF20, 1, 60000)                      = 1
> > 4037.3642   read(5, " %DFFFFF", 4)                          = 4
> > 4037.3650   llseek(7, 7733248, SEEK_SET)                    = 7733248
> > 4037.6213   read(7, "\01A @\0\01A  \0\0 6109E".., 262144)   = 262144
> > 4037.6229       Received signal #16, SIGUSR1 [caught]
> >       siginfo: SIGUSR1 pid=27948 uid=0
> > 4037.6234   sigaction(SIGUSR1, 0xFFBED188, 0xFFBED208)      = 0
> > 4037.6238   sigaction(SIGUSR2, 0xFFBED188, 0xFFBED208)      = 0
> > 4037.6913   
>unlink("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/oral-forum-www.thebody.com/.visitorid.idx.Lkbae3")
> = 0
> > 4037.6921   getpid()                                        = 28192 [1]
> > 4037.6957   write(10, " A\0\0\b r s y n c   e r".., 69)     Err#32 EPIPE
> > 4037.6977       Received signal #13, SIGPIPE [caught]
> > 4037.6982   sigaction(SIGUSR1, 0xFFBEC8B0, 0xFFBEC930)      = 0
> > 4037.7016   sigaction(SIGUSR2, 0xFFBEC8B0, 0xFFBEC930)      = 0
> > 4037.7021   write(10, " A\0\0\b r s y n c   e r".., 69)     Err#32 EPIPE
> > 4037.7026       Received signal #13, SIGPIPE [default]
> >     *** process killed ***
>
>


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html

Reply via email to