Thanks for giving this some thought.

> I don't think this can directly be the culprit, because that ssh's stdout 
> will be
> hooked to a pipe talking to Git, not to the original stdout of "git fetch". It
> should not have even received a descriptor that is a copy of the original
> stdout (nor stdin), since those would have been closed as part of the
> fork+exec.
> 
> The child ssh _does_ have access to the original stderr, which could plausibly
> be a dup of the original stdout. But your strace shows ssh setting the flag
> only for stdin/stdout.

I wondered about that too.  I also wondered why we only have this problem
when doing builds with Jenkins.  The same error has never happened when doing
builds manually as far as I know.  However, stracing the build while it is
running under Jenkins is difficult, so my strace output is from a manual run.
It turns out that ssh only sets non-blocking mode on a descriptor if that
descriptor does not refer to a TTY.  The code in function ssh_session2_open()
looks like:

        if (stdin_null_flag) {
                in = open(_PATH_DEVNULL, O_RDONLY);
        } else {
                in = dup(STDIN_FILENO);
        }
        out = dup(STDOUT_FILENO);
        err = dup(STDERR_FILENO);

        /* enable nonblocking unless tty */
        if (!isatty(in))
                set_nonblock(in);
        if (!isatty(out))
                set_nonblock(out);
        if (!isatty(err))
                set_nonblock(err);

When I collected that strace output, I had stdout redirected to a pipe to my
workaround program, but I did not redirect stderr.  So ssh made stdout 
non-blocking,
but since stderr was still connected to my terminal, it didn't touch that. But 
when
this build is run under Jenkins, both stdout and stderr are connected to a pipe 
that
Jenkins creates to collect output from the build. I assume that when git runs 
ssh, it
does not redirect ssh's stderr to its own pipe, it only redirects stdout. So I 
think
ssh will be messing with both pipes when this build is run under Jenkins.

Now that I have a fairly good understanding of what's happening, I think I can 
work
around this occasional error by redirecting git's stderr to a file or something 
like
that, but it's taken us a long time to figure this out, so I wonder if a more 
permanent
fix shouldn't be implement, so others don't run into the same problem.  A 
google for
"make: write error" indicates that we're not the first to have this problem with
parallel builds, although in the other cases I've found, a specific version of 
the
Linux kernel was being blamed.  Maybe that was a different problem.

I guess git could workaround this by redirecting stderr, but the real problem 
is probably
with ssh, although it's not clear to me what it should do differently. It does 
some
somehow backwards to me that that it only makes a descriptor non-blocking if it 
doesn't
refer to a TTY, but it does the same thing in at least three different places 
so I guess
that's  not a mistake.

Reply via email to