I've encountered an ssh2 channel protocol issue when a ppc64 slave communicates with an x64 master.
Most operations, like sending build logs, work fine. When the time comes to upload artifacts at the end of the build the build stalls indefinitely at:
If I get stack dumps of slave and master using jstack, I see the master waiting to read from the slave:
"Channel reader thread: Fedora16-ppc64-Power7-osuosl-karman" prio=10 tid=0x00000000038c2800 nid=0x6de7 in Object.wait() [0x00007f825ef8b000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
at java.lang.Object.wait(Object.java:502)
at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
- locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946)
- locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
and the slave is waiting for data from the master:
"Channel reader thread: channel" prio=10 tid=0x00000fff940fedd0 nid=0x558e runnable [0x00000fff6dc6d000]
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:236)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
- locked <0x00000fff78ba9f98> (a java.io.BufferedInputStream)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
of course I can't get those dumps at exactly the same moment, even if that were meaningful with network latencies and buffering, but repeated runs never show any other state for either thread.
tshark shows that there's some SSH chatter going on:
0.000000 SLAVE -> MASTER SSH 126 Encrypted response packet len=60
0.176121 MASTER -> SLAVE SSH 94 Encrypted request packet len=28
0.176151 SLAVE -> MASTER TCP 66 ssh > 37501 [ACK] Seq=61 Ack=29 Win=707 Len=0 TSval=4141397874 TSecr=2808266826
but it should well be low level ssh keepalives or similar, as it's at precise 5 second intervals with nothing much else happening. There are three master->slave ssh connections, so it's not guaranteed that it's even the one associated with the stuck channel.
My first thought is endianness.
I don't really know how to begin debugging this issue, though.
|