Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
[1]. Since then OSX build jobs fail rather frequently because of a
SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
corrupt idx' in 't5570-git-daemon.sh' [2]. I think this is a symptom
a real bug in Git affecting other platforms as well, but these tests
are too lax to catch it.
What it boils down to is this sequence:
- The test first prepares a repository containing a corrupt pack,
ready to be server via 'git daemon'.
- Then the test runs 'test_must_fail git fetch ....', which connects
to 'git daemon', which forks 'git upload-pack', which then
advertises refs (only HEAD) and capabilities. So far so good.
- 'git fetch' eventually calls fetch-pack.c:find_common(). The
first half of this function assembles a request consisting of a
want and a flush pkt-line, and sends it via a send_request() call.
At this point the scheduling becomes important: let's suppose that
fetch is slow and upload-pack is fast.
- 'git upload-pack' receives the request, parses the want line,
notices the corrupt pack, responds with an 'ERR upload-pack: not
our ref' pkt-line, and die()s right away.
- 'git fetch' finally approaches the end of the function, where it
attempts to send a done pkt-line via another send_request() call
through the now closing TCP socket.
- What happens now seems to depend on the platform:
- On Linux, both on my machine and on Travis CI, it shows textbook
example behaviour: write() returns with error and sets errno to
ECONNRESET. Since it happens in write_or_die(), 'git fetch'
die()s with 'fatal: write error: Connection reset by peer', and
doesn't show the error send by 'git upload-pack'; how could it,
it doesn't even get as far to receive upload-pack's ERR
pkt-line.
The test only checks that 'git fetch' fails, but it doesn't
check whether it failed with the right error message, so the
test still succeeds. Had it checked the error message as well,
we most likely had noticed this issue already, it doesn't happen
all that rarely.
- On the new OSX images with XCode 9.4 on Travis CI the write()
triggers SIGPIPE right away, and 'test_must_fail' notices it and
fails the test. I couldn't see any sign of an ECONNRESET or any
other error that we could act upon to avoid the SIGPIPE.
- On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor
ECONNRESET, but sending the request actually succeeds even
though there is no process on the other end of the socket
anymore. 'git fetch' then simply continues execution, reads and
parses the ERR pkt-line, and then dies()s with 'fatal: remote
error: upload-pack: not our ref'. So, on the face of it, it
shows the desired behaviour, but I have no idea how that write()
could succeed instead of returning error.
I don't know what happens on a real Mac as I don't have access to one;
I figured out all the above by enabling packet tracing, adding a
couple of well placed tracing printf() and sleep() calls, running a
bunch of builds on Travis CI, and looking through their logs. But
without access to a debugger and netstat and what not I can't really
go any further. So I would now happily pass the baton to those who
have a Mac and know a thing or two about its porting issues to first
check whether OSX on a real Mac shows the same behaviour as it does in
Travis CI's virtualized(?) environment. And then they can pass the
baton to those who know all the intricacies of the pack protocol and
its implementation to decide what to do with this issue.
For a mostly reliable reproduction recipe you might want to fetch this
branch:
https://github.com/szeder/git t5570-git-daemon-sigpipe
and then run 'make && cd t && ./t5570-git-daemon.sh -v -x'
Have fun! ;)
1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce
2 - On git.git's master:
https://travis-ci.org/git/git/jobs/411517552#L2717