Latest findings, some good news, and some very bad. Good news: I was wrong, I did not switch back the system to Java 8 correcly.
The problem is on Bookie side and occours only if the bookie in on Java 9. Bad news: I have a fix. The fix to use Unpooled ByteBufs in serializeProtobuf: private static ByteBuf serializeProtobuf(MessageLite msg, ByteBufAllocator allocator) { int size = msg.getSerializedSize(); ByteBuf buf = Unpooled.buffer(size, size); ... I will continue to track down to the cause, I think it is on the read-path (not sure). On client side we have a flag to not use pooled ByteBufs on Channel Allocator, the most trivial fix at the moment is to make the same on Bookie side as an hotfix for branch 4.6. Before jumping to this extreme hotfix solution I will dig into the issue, now that I know that the problem is ONLY on Java 9 and on the Bookie it will be simpler to find a reproducer. It remains the point that in other systems I have and in test cases there is no failure. Honestly I have no Java 9 bookie in production, only Java 8 bookies, maybe this is the motivation of the fact that no one ever reported this problem from production Enrico 2018-03-14 17:27 GMT+01:00 Ivan Kelly <iv...@apache.org>: > >> > @Ivan > >> > I wonder if some tests on Jepsen with bookie restarts may find this > kind > >> of > >> > issues, given that it is not a network/SO problem > >> If jepsen can catch then normal integration test can. > > I attempted a repro for this using the integration test stuff. > Running for 2-3 hours in a loop, no bug hit. Perhaps I'm not doing > exactly what you are doing. > > https://github.com/ivankelly/bookkeeper/blob/enrico-bug/ > tests/integration/enrico-bug/src/test/java/org/apache/ > bookkeeper/tests/integration/TestEnricoBug.java > > -Ivan >