Hi all,

The issue was observed in cygwin versions 1.7.30-1, 2.0.2-1 and 2.0.4-1 on 
Win7. Upon receiving 'HUGEFILE' form a debian system (scp 
user@server:logDownload_m/TEST.txt .), scp in cygwin hangs from time to time... 

Based on the observed occurrence behaviour of the issue and my experience in 
hunting ghosts, I - based on the new aspect described below - without much 
doubt assume a race condition between the cygwin read()/write() function 
implementations. I started investigating the cygwin-newlib implementation and 
gdb-ing it but I'm not yet that much into the details of 
read()/write()-implementation to identify possible race conditions.
So any contribution / co-working offers / investigation proposals are highly 
appreciated as I can only progress slowly.


As a starting point I'd have, based on: "scp and it's child process ssh share a 
pipe to receive the data of the downloaded file..." the following Q:
1) Is a child forked into a new, separate Win-process or is the fork simulated 
in cygwin, e.g. by threads?
2) Is the pipe implementation in cygwin in such a setup using Win-methods to 
forward the data or is the pipe-functionality simulated in cygwin?
3) Can the action of providing data for reading from a pipe by the consumer 
process (i.e. indicating data availability and the actual copying of the data 
in a read()-call) be interrupted/impacted by a write()-call from a 
pipe-data-producing process?


This continues https://sourceware.org/ml/cygwin/2015-06/msg00008.html but I 
wanted to start a new thread as the issue seems to be of a more generic nature 
and inside the cygwin core, i.e. independent of scp/ssh.
Although the issue might therefore be reproducible in a more simple way, I rely 
on the scp/ssh-based test environment described in the old thread, as I can 
easily reproduce it by those means, an intermediate test in a different env 
showed occurrence rates as low as < 0.03%.
So, in case you want to reproduce the setup I use and describe in parts below, 
please refer to my descriptions given in the older ML thread.


What I found out now by additionally instrumenting the openssh-file 
"atomicio.c" in "/usr/src/openssh-6.8p1-1.src/openssh-6.8p1" as follows (all 
lines containing "TK:" were added - not quite sophisticated code, sorry for 
that):

---8<---
size_t
atomicio6(ssize_t (*f) (int, void *, size_t), int fd, void *_s, size_t n,
    int (*cb)(void *, size_t), void *cb_arg)
{
        char *s = _s;
        size_t pos = 0;
        ssize_t res;
        struct pollfd pfd;

#ifndef BROKEN_READ_COMPARISON
        pfd.fd = fd;
        pfd.events = f == read ? POLLIN : POLLOUT;
#endif

        int Q1 = n/4; //TK:
        int Q2 = Q1 + Q1; //TK:
        int Q3 = Q2 + Q1; //TK:
        int Q4 = n; //TK:

if (pfd.events == POLLIN && 16 < Q1) //TK: if read
{ //TK:
        s[0] = 0; //TK:
        s[Q1-16] = 0; //TK:
        s[Q2-16] = 0; //TK:
        s[Q3-16] = 0; //TK:
        s[Q4-16] = 0; //TK:
} //TK:

        while (n > pos) {
                res = (f) (fd, s + pos, n - pos);

if (pfd.events == POLLIN && 16 < Q1)       //TK: if read
logit("TK: atomicio.c: atomicio6: fd=%d, event=%d, res=%05d, errno=%03d,        
    s[0]=%.16s, s[Q1-16]=%.32s, s[Q2-16]=%.32s, s[Q3-16]=%.32s, 
s[Q4-16]=%.16s", pfd.fd, pfd.events, res, errno, s, s+Q1-16, s+Q2-16, s+Q3-16, 
s+Q4-16);
                switch (res) {
                case -1:
                        if (errno == EINTR)
                                continue;
                        if (errno == EAGAIN || errno == EWOULDBLOCK) {
#ifndef BROKEN_READ_COMPARISON
if (pfd.events == POLLIN && 16 < Q1)       //TK: if read
logit("TK: atomicio.c: atomicio6: fd=%d, event=%d, entering poll ...", pfd.fd, 
pfd.events);
                                (void)poll(&pfd, 1, -1);
if (pfd.events == POLLIN && 16 < Q1)       //TK: if read
logit("TK: atomicio.c: atomicio6: fd=%d, event=%d, poll returned, 
pollfd.revents=%d", pfd.fd, pfd.events, pfd.revents);
#endif
                                continue;
                        }
                        return 0;
                case 0:
                        errno = EPIPE;
logit("TK: atomicio.c: atomicio6: fd=%d, event=%d, returning, pos=%d", pfd.fd, 
pfd.events, pos);
                        return pos;
                default:
                        pos += (size_t)res;
                        if (cb != NULL && cb(cb_arg, (size_t)res) == -1) {
                                errno = EINTR;
                                return pos;
                        }
                }
        }
        return pos;
}
--->8---


... and by again running my test until scp hangs:
$ while true; do DATE=`date +%Y%m%d_%H%M%S`; echo -n "started: $DATE"; 
/usr/local/bin/scp -v -C -o BatchMode=yes 
user@10.IP2.IP3.IP4:"logDownload_m/TEST.txt" . > scp_$DATE.txt 2>&1; echo " - 
finished: `date +%Y%m%d_%H%M%S`"; rm -v `ls -1 scp_2015* | head -n -3`; sleep 
5; done


... showed the 16k block "B014569" missing in the received file ('.' in output 
lines by me to preserve alignment):
$ sdiff -stw40 <(cat TEST.txt | sed -re "s/(B0.....-N001024-)/\1\n/g" | cut -d- 
-f-2) <(seq -w 1 1 94796 | sed -re "s/.*/B0&-N000001/g")
.                   >  B014569-N000001
.B094793-N000001    <


... Searching that block number in the produced scp/ssh-trace (the code for the 
other log-printouts is given in the referenced old thread) shows that the data 
of the 16k block "B014569" is correctly placed into the read-buffer which is 
handed over to the "read()"-call (line 50787 below) but a wrong number of read 
bytes is returned by "read()": "res=-0001". Directly before (lines 50784 - 
50786), this 16k data block was received by ssh via the socket and written to 
the pipe towards the scp parent process.
Based on that wrong return value, atomicio6() is not able to correctly update 
the pointer of its read-buffer and the amount of data to read in the loop. In 
the subsequent "read()"-call, the already received 16k block "B014569" in the 
read-buffer is overwritten with the next 16k block "B014570" (line 50793).
This causes the appearance of missing data blocks in the received file as 
described in the old ML thread referenced above.

$ cat -n scp_20150701_123824.txt
 . . .
 50769  TK: scp.c: sink: amt=65536, j=65536, i=238551040, size=1553121293, 
count=65536, buf=B014561-N000001-
 50770  TK: atomicio.c: atomicio6: fd=8, event=1, res=16384, errno=011,         
   s[0]=B014565-N000001-, s[Q1-16]=B014565-N001024-B014562-N000001-, s[Q2-16]=, 
s[Q3-16]=, s[Q4-16]=
 50771  TK: atomicio.c: atomicio6: fd=8, event=1, res=-0001, errno=011,         
   s[0]=B014565-N000001-, s[Q1-16]=B014565-N001024-B014562-N000001-, s[Q2-16]=, 
s[Q3-16]=, s[Q4-16]=
 50772  TK: atomicio.c: atomicio6: fd=8, event=1, entering poll ...
 50773  TK: dispatch.c: ssh_dispatch_run: type=94
 50774  TK: channels.c: channel_input_data: data_len=16384, buf=B014566-N000001-
 50775  TK: dispatch.c: ssh_dispatch_run: type=94
 50776  TK: channels.c: channel_input_data: data_len=16384, buf=B014567-N000001-
 50777  TK: channels.c: channel_handle_wfd(2): write: len=32768, 
buf=B014566-N000001-
 50778  TK: dispatch.c: ssh_dispatch_run: type=94
 50779  TK: channels.c: channel_input_data: data_len=16384, buf=B014568-N000001-
 50780  TK: channels.c: channel_handle_wfd(2): write: len=16384, 
buf=B014568-N000001-
 50781  TK: atomicio.c: atomicio6: fd=8, event=1, poll returned, 
pollfd.revents=1
 50782  TK: atomicio.c: atomicio6: fd=8, event=1, res=49152, errno=011,         
   s[0]=B014565-N000001-, s[Q1-16]=B014565-N001024-B014566-N000001-, 
s[Q2-16]=B014566-N001024-B014567-N000001-, 
s[Q3-16]=B014567-N001024-B014568-N000001-, s[Q4-16]=B014568-N001024-
 50783  TK: scp.c: sink: amt=65536, j=65536, i=238616576, size=1553121293, 
count=65536, buf=B014565-N000001-
 50784  TK: dispatch.c: ssh_dispatch_run: type=94
 50785  TK: channels.c: channel_input_data: data_len=16384, buf=B014569-N000001-
 50786  TK: channels.c: channel_handle_wfd(2): write: len=16384, 
buf=B014569-N000001-
 50787  TK: atomicio.c: atomicio6: fd=8, event=1, res=-0001, errno=011,         
   s[0]=B014569-N000001-, s[Q1-16]=B014569-N001024-B014566-N000001-, s[Q2-16]=, 
s[Q3-16]=, s[Q4-16]=
 50788  TK: atomicio.c: atomicio6: fd=8, event=1, entering poll ...
 50789  TK: dispatch.c: ssh_dispatch_run: type=94
 50790  TK: channels.c: channel_input_data: data_len=16384, buf=B014570-N000001-
 50791  TK: channels.c: channel_handle_wfd(2): write: len=16384, 
buf=B014570-N000001-
 50792  TK: atomicio.c: atomicio6: fd=8, event=1, poll returned, 
pollfd.revents=1
 50793  TK: atomicio.c: atomicio6: fd=8, event=1, res=16384, errno=011,         
   s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014566-N000001-, s[Q2-16]=, 
s[Q3-16]=, s[Q4-16]=
 50794  TK: atomicio.c: atomicio6: fd=8, event=1, res=-0001, errno=011,         
   s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014566-N000001-, s[Q2-16]=, 
s[Q3-16]=, s[Q4-16]=
 50795  TK: atomicio.c: atomicio6: fd=8, event=1, entering poll ...
 50796  TK: dispatch.c: ssh_dispatch_run: type=94
 50797  TK: channels.c: channel_input_data: data_len=16384, buf=B014571-N000001-
 50798  TK: channels.c: channel_handle_wfd(2): write: len=16384, 
buf=B014571-N000001-
 50799  TK: atomicio.c: atomicio6: fd=8, event=1, poll returned, 
pollfd.revents=1
 50800  TK: atomicio.c: atomicio6: fd=8, event=1, res=16384, errno=011,         
   s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014571-N000001-, 
s[Q2-16]=B014571-N001024-B014567-N000001-, s[Q3-16]=, s[Q4-16]=
 50801  TK: atomicio.c: atomicio6: fd=8, event=1, res=-0001, errno=011,         
   s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014571-N000001-, 
s[Q2-16]=B014571-N001024-B014567-N000001-, s[Q3-16]=, s[Q4-16]=
 50802  TK: atomicio.c: atomicio6: fd=8, event=1, entering poll ...
 50803  TK: dispatch.c: ssh_dispatch_run: type=94
 50804  TK: channels.c: channel_input_data: data_len=16384, buf=B014572-N000001-
 50805  TK: dispatch.c: ssh_dispatch_run: type=94
 50806  TK: channels.c: channel_input_data: data_len=16384, buf=B014573-N000001-
 50807  TK: channels.c: channel_handle_wfd(2): write: len=32768, 
buf=B014572-N000001-
 50808  TK: atomicio.c: atomicio6: fd=8, event=1, poll returned, 
pollfd.revents=1
 50809  TK: atomicio.c: atomicio6: fd=8, event=1, res=32768, errno=011,         
   s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014571-N000001-, 
s[Q2-16]=B014571-N001024-B014572-N000001-, 
s[Q3-16]=B014572-N001024-B014573-N000001-, s[Q4-16]=B014573-N001024-
 50810  TK: scp.c: sink: amt=65536, j=65536, i=238682112, size=1553121293, 
count=65536, buf=B014570-N000001-
 . . .

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply via email to