Hi all, The issue was observed in cygwin versions 1.7.30-1, 2.0.2-1 and 2.0.4-1 on Win7. Upon receiving 'HUGEFILE' form a debian system (scp user@server:logDownload_m/TEST.txt .), scp in cygwin hangs from time to time...
Based on the observed occurrence behaviour of the issue and my experience in hunting ghosts, I - based on the new aspect described below - without much doubt assume a race condition between the cygwin read()/write() function implementations. I started investigating the cygwin-newlib implementation and gdb-ing it but I'm not yet that much into the details of read()/write()-implementation to identify possible race conditions. So any contribution / co-working offers / investigation proposals are highly appreciated as I can only progress slowly. As a starting point I'd have, based on: "scp and it's child process ssh share a pipe to receive the data of the downloaded file..." the following Q: 1) Is a child forked into a new, separate Win-process or is the fork simulated in cygwin, e.g. by threads? 2) Is the pipe implementation in cygwin in such a setup using Win-methods to forward the data or is the pipe-functionality simulated in cygwin? 3) Can the action of providing data for reading from a pipe by the consumer process (i.e. indicating data availability and the actual copying of the data in a read()-call) be interrupted/impacted by a write()-call from a pipe-data-producing process? This continues https://sourceware.org/ml/cygwin/2015-06/msg00008.html but I wanted to start a new thread as the issue seems to be of a more generic nature and inside the cygwin core, i.e. independent of scp/ssh. Although the issue might therefore be reproducible in a more simple way, I rely on the scp/ssh-based test environment described in the old thread, as I can easily reproduce it by those means, an intermediate test in a different env showed occurrence rates as low as < 0.03%. So, in case you want to reproduce the setup I use and describe in parts below, please refer to my descriptions given in the older ML thread. What I found out now by additionally instrumenting the openssh-file "atomicio.c" in "/usr/src/openssh-6.8p1-1.src/openssh-6.8p1" as follows (all lines containing "TK:" were added - not quite sophisticated code, sorry for that): ---8<--- size_t atomicio6(ssize_t (*f) (int, void *, size_t), int fd, void *_s, size_t n, int (*cb)(void *, size_t), void *cb_arg) { char *s = _s; size_t pos = 0; ssize_t res; struct pollfd pfd; #ifndef BROKEN_READ_COMPARISON pfd.fd = fd; pfd.events = f == read ? POLLIN : POLLOUT; #endif int Q1 = n/4; //TK: int Q2 = Q1 + Q1; //TK: int Q3 = Q2 + Q1; //TK: int Q4 = n; //TK: if (pfd.events == POLLIN && 16 < Q1) //TK: if read { //TK: s[0] = 0; //TK: s[Q1-16] = 0; //TK: s[Q2-16] = 0; //TK: s[Q3-16] = 0; //TK: s[Q4-16] = 0; //TK: } //TK: while (n > pos) { res = (f) (fd, s + pos, n - pos); if (pfd.events == POLLIN && 16 < Q1) //TK: if read logit("TK: atomicio.c: atomicio6: fd=%d, event=%d, res=%05d, errno=%03d, s[0]=%.16s, s[Q1-16]=%.32s, s[Q2-16]=%.32s, s[Q3-16]=%.32s, s[Q4-16]=%.16s", pfd.fd, pfd.events, res, errno, s, s+Q1-16, s+Q2-16, s+Q3-16, s+Q4-16); switch (res) { case -1: if (errno == EINTR) continue; if (errno == EAGAIN || errno == EWOULDBLOCK) { #ifndef BROKEN_READ_COMPARISON if (pfd.events == POLLIN && 16 < Q1) //TK: if read logit("TK: atomicio.c: atomicio6: fd=%d, event=%d, entering poll ...", pfd.fd, pfd.events); (void)poll(&pfd, 1, -1); if (pfd.events == POLLIN && 16 < Q1) //TK: if read logit("TK: atomicio.c: atomicio6: fd=%d, event=%d, poll returned, pollfd.revents=%d", pfd.fd, pfd.events, pfd.revents); #endif continue; } return 0; case 0: errno = EPIPE; logit("TK: atomicio.c: atomicio6: fd=%d, event=%d, returning, pos=%d", pfd.fd, pfd.events, pos); return pos; default: pos += (size_t)res; if (cb != NULL && cb(cb_arg, (size_t)res) == -1) { errno = EINTR; return pos; } } } return pos; } --->8--- ... and by again running my test until scp hangs: $ while true; do DATE=`date +%Y%m%d_%H%M%S`; echo -n "started: $DATE"; /usr/local/bin/scp -v -C -o BatchMode=yes user@10.IP2.IP3.IP4:"logDownload_m/TEST.txt" . > scp_$DATE.txt 2>&1; echo " - finished: `date +%Y%m%d_%H%M%S`"; rm -v `ls -1 scp_2015* | head -n -3`; sleep 5; done ... showed the 16k block "B014569" missing in the received file ('.' in output lines by me to preserve alignment): $ sdiff -stw40 <(cat TEST.txt | sed -re "s/(B0.....-N001024-)/\1\n/g" | cut -d- -f-2) <(seq -w 1 1 94796 | sed -re "s/.*/B0&-N000001/g") . > B014569-N000001 .B094793-N000001 < ... Searching that block number in the produced scp/ssh-trace (the code for the other log-printouts is given in the referenced old thread) shows that the data of the 16k block "B014569" is correctly placed into the read-buffer which is handed over to the "read()"-call (line 50787 below) but a wrong number of read bytes is returned by "read()": "res=-0001". Directly before (lines 50784 - 50786), this 16k data block was received by ssh via the socket and written to the pipe towards the scp parent process. Based on that wrong return value, atomicio6() is not able to correctly update the pointer of its read-buffer and the amount of data to read in the loop. In the subsequent "read()"-call, the already received 16k block "B014569" in the read-buffer is overwritten with the next 16k block "B014570" (line 50793). This causes the appearance of missing data blocks in the received file as described in the old ML thread referenced above. $ cat -n scp_20150701_123824.txt . . . 50769 TK: scp.c: sink: amt=65536, j=65536, i=238551040, size=1553121293, count=65536, buf=B014561-N000001- 50770 TK: atomicio.c: atomicio6: fd=8, event=1, res=16384, errno=011, s[0]=B014565-N000001-, s[Q1-16]=B014565-N001024-B014562-N000001-, s[Q2-16]=, s[Q3-16]=, s[Q4-16]= 50771 TK: atomicio.c: atomicio6: fd=8, event=1, res=-0001, errno=011, s[0]=B014565-N000001-, s[Q1-16]=B014565-N001024-B014562-N000001-, s[Q2-16]=, s[Q3-16]=, s[Q4-16]= 50772 TK: atomicio.c: atomicio6: fd=8, event=1, entering poll ... 50773 TK: dispatch.c: ssh_dispatch_run: type=94 50774 TK: channels.c: channel_input_data: data_len=16384, buf=B014566-N000001- 50775 TK: dispatch.c: ssh_dispatch_run: type=94 50776 TK: channels.c: channel_input_data: data_len=16384, buf=B014567-N000001- 50777 TK: channels.c: channel_handle_wfd(2): write: len=32768, buf=B014566-N000001- 50778 TK: dispatch.c: ssh_dispatch_run: type=94 50779 TK: channels.c: channel_input_data: data_len=16384, buf=B014568-N000001- 50780 TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B014568-N000001- 50781 TK: atomicio.c: atomicio6: fd=8, event=1, poll returned, pollfd.revents=1 50782 TK: atomicio.c: atomicio6: fd=8, event=1, res=49152, errno=011, s[0]=B014565-N000001-, s[Q1-16]=B014565-N001024-B014566-N000001-, s[Q2-16]=B014566-N001024-B014567-N000001-, s[Q3-16]=B014567-N001024-B014568-N000001-, s[Q4-16]=B014568-N001024- 50783 TK: scp.c: sink: amt=65536, j=65536, i=238616576, size=1553121293, count=65536, buf=B014565-N000001- 50784 TK: dispatch.c: ssh_dispatch_run: type=94 50785 TK: channels.c: channel_input_data: data_len=16384, buf=B014569-N000001- 50786 TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B014569-N000001- 50787 TK: atomicio.c: atomicio6: fd=8, event=1, res=-0001, errno=011, s[0]=B014569-N000001-, s[Q1-16]=B014569-N001024-B014566-N000001-, s[Q2-16]=, s[Q3-16]=, s[Q4-16]= 50788 TK: atomicio.c: atomicio6: fd=8, event=1, entering poll ... 50789 TK: dispatch.c: ssh_dispatch_run: type=94 50790 TK: channels.c: channel_input_data: data_len=16384, buf=B014570-N000001- 50791 TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B014570-N000001- 50792 TK: atomicio.c: atomicio6: fd=8, event=1, poll returned, pollfd.revents=1 50793 TK: atomicio.c: atomicio6: fd=8, event=1, res=16384, errno=011, s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014566-N000001-, s[Q2-16]=, s[Q3-16]=, s[Q4-16]= 50794 TK: atomicio.c: atomicio6: fd=8, event=1, res=-0001, errno=011, s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014566-N000001-, s[Q2-16]=, s[Q3-16]=, s[Q4-16]= 50795 TK: atomicio.c: atomicio6: fd=8, event=1, entering poll ... 50796 TK: dispatch.c: ssh_dispatch_run: type=94 50797 TK: channels.c: channel_input_data: data_len=16384, buf=B014571-N000001- 50798 TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B014571-N000001- 50799 TK: atomicio.c: atomicio6: fd=8, event=1, poll returned, pollfd.revents=1 50800 TK: atomicio.c: atomicio6: fd=8, event=1, res=16384, errno=011, s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014571-N000001-, s[Q2-16]=B014571-N001024-B014567-N000001-, s[Q3-16]=, s[Q4-16]= 50801 TK: atomicio.c: atomicio6: fd=8, event=1, res=-0001, errno=011, s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014571-N000001-, s[Q2-16]=B014571-N001024-B014567-N000001-, s[Q3-16]=, s[Q4-16]= 50802 TK: atomicio.c: atomicio6: fd=8, event=1, entering poll ... 50803 TK: dispatch.c: ssh_dispatch_run: type=94 50804 TK: channels.c: channel_input_data: data_len=16384, buf=B014572-N000001- 50805 TK: dispatch.c: ssh_dispatch_run: type=94 50806 TK: channels.c: channel_input_data: data_len=16384, buf=B014573-N000001- 50807 TK: channels.c: channel_handle_wfd(2): write: len=32768, buf=B014572-N000001- 50808 TK: atomicio.c: atomicio6: fd=8, event=1, poll returned, pollfd.revents=1 50809 TK: atomicio.c: atomicio6: fd=8, event=1, res=32768, errno=011, s[0]=B014570-N000001-, s[Q1-16]=B014570-N001024-B014571-N000001-, s[Q2-16]=B014571-N001024-B014572-N000001-, s[Q3-16]=B014572-N001024-B014573-N000001-, s[Q4-16]=B014573-N001024- 50810 TK: scp.c: sink: amt=65536, j=65536, i=238682112, size=1553121293, count=65536, buf=B014570-N000001- . . . -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple