Hi,

 

We started working with rsync to sync data between remote sites.

We started seeing many stuck rsync processes.

Usually it happens in the ssh stage while the ssh issues a “select” syscall on fd #4 while it is long gone.

 

Here is an example:

 

[EMAIL PROTECTED]:/root# ps -efwww |grep ekrimer

ekrimer  28619  4979  0 Jan28 ?        00:00:00 /var/netstar//lib/build_0134_19/nbjobleader.out /arch/projects/gesher/gesher_high /a/nfs/iil/proj/mpgarch/arch_vpool_1/ekrimer/ambig/task_nhm /netbatch /a/nfs/iil/proj/mpgarch/arch_vpool_1/ekrimer/ambig/task_nhm/##post_exec_1.vpool_idc.7496709 /netbatch/##post_exec_1.vpool_idc.7496709 /a/nfs/iil/proj/mpgarch/arch_vpool_1/ekrimer/ambig/task_nhm/##post_exec_1.vpool_idc.7496709 /netbatch/##post_exec_1.vpool_idc.7496709 0 batch ptsl2171 BATCH 1138446228 1138472243 1.vpool_idc.7496709 19  ,cputime soft = unlimited,cputime hard = unlimited,filesize soft = unlimited,filesize hard = unlimited,datasize soft = unlimited,datasize hard = unlimited,stacksize soft = 8192,stacksize hard = unlimited,coredumpsize soft = 0,coredumpsize hard = unlimited,openfiles soft = 1024,openfiles hard = 8192,descriptors soft = 1024,descriptors hard = 8192,addressspace soft = unlimited,addressspace hard = unlimited,memorylocked soft = unlimited,memorylocked hard = unlimited,maxproc soft = 16384,maxproc hard = 16384,memoryuse soft = unlimited,memoryuse hard = unlimited null false false false 5 0 /nfs/site/proj/mpgarch/perf/tools/scripts/bin/arch_post.csh /netbatch/ekrimer/task_nhm_296/runs /nfs/site/proj/mpgarch/arch_vpool_1/ekrimer/ambig/results

ekrimer  28620 28619  0 Jan28 ?        00:00:00 /bin/csh -f /nfs/site/proj/mpgarch/perf/tools/scripts/bin/arch_post.csh /netbatch/ekrimer/task_nhm_296/runs /nfs/site/proj/mpgarch/arch_vpool_1/ekrimer/ambig/results

ekrimer  28641 28620  0 Jan28 ?        00:00:00 /usr/intel/bin/rsync -e ssh -azx --rsync-path=/usr/intel/bin/rsync /netbatch/ekrimer/task_nhm_296/runs rsync-mpgarch.iil.intel.com:/nfs/site/proj/mpgarch/arch_vpool_1/ekrimer/ambig/results

ekrimer  28642 28641  0 Jan28 ?        00:00:00 ssh rsync-mpgarch.iil.intel.com /usr/intel/bin/rsync --server -logDtprxz . /nfs/site/proj/mpgarch/arch_vpool_1/ekrimer/ambig/results

root      7647  7606  0 20:22 pts/0    00:00:00 grep ekrimer

[EMAIL PROTECTED]:/root# strace -p 28620

[EMAIL PROTECTED]:/root# strace -p 28641

select(5, NULL, [4], NULL, {48, 20000} <unfinished ...>

[EMAIL PROTECTED]:/root# ls -l /proc/28641/fd/5

lrwx------    1 ekrimer  arch           64 Jan 30 20:23 /proc/28641/fd/5 -> socket:[92942621]

[EMAIL PROTECTED]:/root# strace -p 28642

select(4, [], [3], NULL, NULL <unfinished ...>

[EMAIL PROTECTED]:/root# ls -l /proc/28642/fd/4

ls: /proc/28642/fd/4: No such file or directory

[EMAIL PROTECTED]:/root# ls -l /proc/*/fd/* | grep 'socket:\[92942621\]'

ls: /proc/8035/fd/255: No such file or directory

ls: /proc/8035/fd/3: No such file or directory

ls: /proc/self/fd/255: No such file or directory

ls: /proc/self/fd/3: No such file or directory

lrwx------    1 ekrimer  arch           64 Jan 30 20:25 /proc/28641/fd/5 -> socket:[92942621]

 

Does anybody have an idea the reason for that?

 

Thanks,

Oren Mark

Intel - Israel Engineering Computing

Unix Server Platforms

[EMAIL PROTECTED]

(+) 972-4-865-5987

iNET: 465-5987

 

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to