Re: Help with bad errors on 4.6.1

2018-03-27 Thread Enrico Olivelli
End of this story With this patch the problem does not occur anymore https://github.com/apache/bookkeeper/pull/1293 The patch does not address directly the problem, the root source is still unknown, this is very bad. But with that change no error is reported anymore, so actually it is enough to g

Re: Help with bad errors on 4.6.1

2018-03-16 Thread Enrico Olivelli
resending (GMAIL webmail messed up prev email) 2018-03-16 10:34 GMT+01:00 Sijie Guo : > On Fri, Mar 16, 2018 at 2:26 AM, Enrico Olivelli > wrote: > > > Thank you Sijie, > > I have already applied a similar patch to my local code based on 4.6.1 > but > > the problem remains. > > > > What do you

Re: Help with bad errors on 4.6.1

2018-03-16 Thread Enrico Olivelli
2018-03-16 10:34 GMT+01:00 Sijie Guo : > On Fri, Mar 16, 2018 at 2:26 AM, Enrico Olivelli > wrote: > > > Thank you Sijie, > > I have already applied a similar patch to my local code based on 4.6.1 > but > > the problem remains. > > > > What do you mean "the problem" here? > missing buf.release()

Re: Help with bad errors on 4.6.1

2018-03-16 Thread Sijie Guo
On Fri, Mar 16, 2018 at 2:26 AM, Enrico Olivelli wrote: > Thank you Sijie, > I have already applied a similar patch to my local code based on 4.6.1 but > the problem remains. > What do you mean "the problem" here? Do you mean the corruption problem or the leaking problem? The change I pointed o

Re: Help with bad errors on 4.6.1

2018-03-16 Thread Enrico Olivelli
Thank you Sijie, I have already applied a similar patch to my local code based on 4.6.1 but the problem remains. I am looking into Netty allocateUnitializedArray, which is used for Pooled Heap Buffers. You all are more aware of BK code than me, is there any point in which we assume that the buffer

Re: Help with bad errors on 4.6.1

2018-03-16 Thread Ivan Kelly
> With "paranoid" log in Netty I found this that is very interesting, but it > happens even on Java 8. I don't think leaks are the problem here though. This seems to be more like a doublefree issue. -Ivan

Re: Help with bad errors on 4.6.1

2018-03-16 Thread Sijie Guo
On Fri, Mar 16, 2018 at 1:24 AM, Enrico Olivelli wrote: > 2018-03-15 12:02 GMT+01:00 Enrico Olivelli : > > > > > > > 2018-03-15 11:13 GMT+01:00 Ivan Kelly : > > > >> > What is the difference in Channel#write/ByteBuf pooling.in Java 9 > ? > >> Sounds like it could be an issue in netty itself.

Re: Help with bad errors on 4.6.1

2018-03-16 Thread Enrico Olivelli
2018-03-15 12:02 GMT+01:00 Enrico Olivelli : > > > 2018-03-15 11:13 GMT+01:00 Ivan Kelly : > >> > What is the difference in Channel#write/ByteBuf pooling.in Java 9 ? >> Sounds like it could be an issue in netty itself. Java 9 removed a >> bunch of stuff around Unsafe, which I'm pretty sure net

Re: Help with bad errors on 4.6.1

2018-03-15 Thread Enrico Olivelli
2018-03-15 11:13 GMT+01:00 Ivan Kelly : > > What is the difference in Channel#write/ByteBuf pooling.in Java 9 ? > Sounds like it could be an issue in netty itself. Java 9 removed a > bunch of stuff around Unsafe, which I'm pretty sure netty was using > for ByteBuf. Have you tried setting the p

Re: Help with bad errors on 4.6.1

2018-03-15 Thread Ivan Kelly
> What is the difference in Channel#write/ByteBuf pooling.in Java 9 ? Sounds like it could be an issue in netty itself. Java 9 removed a bunch of stuff around Unsafe, which I'm pretty sure netty was using for ByteBuf. Have you tried setting the pool debugging to paranoid? -Dio.netty.leakDetect

Re: Help with bad errors on 4.6.1

2018-03-15 Thread Enrico Olivelli
Very latest news: I have narrowed the problem to ResponseEnDecoderV3#encode, using UnpooledByteBufAllocator.DEFAULT instead of the allocator from the channel the error disappear. So the problem is about the encoding of the responses, using Java 9 and Pooled Byte Bufs. This is compatible with the e

Re: Help with bad errors on 4.6.1

2018-03-14 Thread Enrico Olivelli
Latest findings, some good news, and some very bad. Good news: I was wrong, I did not switch back the system to Java 8 correcly. The problem is on Bookie side and occours only if the bookie in on Java 9. Bad news: I have a fix. The fix to use Unpooled ByteBufs in serializeProtobuf: private stat

Re: Help with bad errors on 4.6.1

2018-03-14 Thread Ivan Kelly
>> > @Ivan >> > I wonder if some tests on Jepsen with bookie restarts may find this kind >> of >> > issues, given that it is not a network/SO problem >> If jepsen can catch then normal integration test can. I attempted a repro for this using the integration test stuff. Running for 2-3 hours in a l

Re: Help with bad errors on 4.6.1

2018-03-13 Thread Enrico Olivelli
2018-03-13 17:19 GMT+01:00 Ivan Kelly : > > @Ivan > > I wonder if some tests on Jepsen with bookie restarts may find this kind > of > > issues, given that it is not a network/SO problem > If jepsen can catch then normal integration test can. The readers in > question, are they tailing with long po

Re: Help with bad errors on 4.6.1

2018-03-13 Thread Ivan Kelly
> @Ivan > I wonder if some tests on Jepsen with bookie restarts may find this kind of > issues, given that it is not a network/SO problem If jepsen can catch then normal integration test can. The readers in question, are they tailing with long poll, or just calling readLastAddConfirmed in a loop? W

Re: Help with bad errors on 4.6.1

2018-03-13 Thread Enrico Olivelli
Findings of today: A - the system fails even with BK 4.6.0 B - we have moved all the clients and the bookies to different machines (keeping the same ZK cluster), same problem C - I have copies of the application which are running on other similar machines (on the same Blade/VMWare system) D - I hav

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Enrico Olivelli
Il lun 12 mar 2018, 20:40 Ivan Kelly ha scritto: > > It is interesting that the problems is on 'readers' and it seems that the > > PCBC seems corrupted and even writes (if the broker is promoted to > > 'leader') are able to go on after the reads broke the client. > Are writes coming from the same

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Ivan Kelly
> It is interesting that the problems is on 'readers' and it seems that the > PCBC seems corrupted and even writes (if the broker is promoted to > 'leader') are able to go on after the reads broke the client. Are writes coming from the same clients? Or clients in the same process? -Ivan

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Enrico Olivelli
Il lun 12 mar 2018, 19:37 Sijie Guo ha scritto: > Thanks Enrico! > > On Mon, Mar 12, 2018 at 4:21 AM, Enrico Olivelli > wrote: > > > Summary of my findings: > > > > The problem is about clients which get messed up and are not able to read > > and write to bookies after rolling restarts of an app

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Sijie Guo
Thanks Enrico! On Mon, Mar 12, 2018 at 4:21 AM, Enrico Olivelli wrote: > Summary of my findings: > > The problem is about clients which get messed up and are not able to read > and write to bookies after rolling restarts of an application, > the problem appears only on a cluster of 6 machines (r

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Ivan Kelly
> - when I "restart" bookies I issue a kill -9 (I think this could be the > reason why I can't reproduce the issue on testcases) With a clean shutdown of bookies we close the channels, and it should do the tcp shutdown handshake. -9 will kill the process before it gets to do any of that, but the ke

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Enrico Olivelli
Summary of my findings: The problem is about clients which get messed up and are not able to read and write to bookies after rolling restarts of an application, the problem appears only on a cluster of 6 machines (reduced to 3 in order to narrow down the search) of my colleagues which are performi

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Enrico Olivelli
I will send a report soon. With new debug I have some finding, I am looking into problems during restarts of bookies. Maybe there is some problem in error handling in PCBC. Thank you Enrico 2018-03-12 10:58 GMT+01:00 Ivan Kelly : > Enrico, could you summarize what the state of things is now? Wha

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Ivan Kelly
Enrico, could you summarize what the state of things is now? What are you running, what problems are you seeing and how are the problems manifesting themselves. Regards, Ivan On Mon, Mar 12, 2018 at 10:15 AM, Enrico Olivelli wrote: > Applyed Sijie's fixes and added some debug: > > Problem is tri

Re: Help with bad errors on 4.6.1

2018-03-12 Thread Enrico Olivelli
Applyed Sijie's fixes and added some debug: Problem is triggered when you restart a bookie (I have a cluster of 3 bookies, WQ = 2 and AQ = 2) Below a new error on client side ("tailing" reader) Enrico this is a new error on client side: 18-03-12-09-11-45Unexpected exception caught by bookie

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Sijie Guo
Enrico, I would suggest you applied my fixes and then debug from there. In this way, you will have a better sense where the first corruption is from. Sijie On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli wrote: > Il ven 9 mar 2018, 19:30 Enrico Olivelli ha scritto: > > > Thank you Ivan! > > I

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Il ven 9 mar 2018, 19:30 Enrico Olivelli ha scritto: > Thank you Ivan! > I hope I did not mess up the dump and added ZK ports. We are not using > standard ports and in that 3 machines there is also the 3 nodes zk > ensemble which is supporting BK and all the other parts of the application > > S

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Thank you Ivan! I hope I did not mess up the dump and added ZK ports. We are not using standard ports and in that 3 machines there is also the 3 nodes zk ensemble which is supporting BK and all the other parts of the application So one explanation would be that something is connecting to the boo

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli wrote: > Bookies > 10.168.10.117:1822 -> bad bookie with 4.1.21 > 10.168.10.116:1822 -> bookie with 4.1.12 > 10.168.10.118:1281 -> bookie with 4.1.12 > > 10.168.10.117 client machine on which I have 4.1.21 client (different > process than the bookie

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Bookies 10.168.10.117:1822 -> bad bookie with 4.1.21 10.168.10.116:1822 -> bookie with 4.1.12 10.168.10.118:1281 -> bookie with 4.1.12 10.168.10.117 client machine on which I have 4.1.21 client (different process than the bookie one) Thanks Enrico 2018-03-09 15:16 GMT+01:00 Ivan Kelly : > On

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
Also, do you have the logs of the error occurring on the server side? -Ivan On Fri, Mar 9, 2018 at 3:16 PM, Ivan Kelly wrote: > On Fri, Mar 9, 2018 at 3:13 PM, Enrico Olivelli wrote: >> New dump, >> sequence (simpler) >> >> 1) system is running, reader is reading without errors with netty 4.1.2

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
On Fri, Mar 9, 2018 at 3:13 PM, Enrico Olivelli wrote: > New dump, > sequence (simpler) > > 1) system is running, reader is reading without errors with netty 4.1.21 > 2) 3 bookies, one is with 4.1.21 and the other ones with 4.1.12 > 3) kill one bookie with 4.1.12, the reader starts reading from th

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
New dump, sequence (simpler) 1) system is running, reader is reading without errors with netty 4.1.21 2) 3 bookies, one is with 4.1.21 and the other ones with 4.1.12 3) kill one bookie with 4.1.12, the reader starts reading from the bookie with 4.1.21 4) client messes up, unrecoverably Enrico 2

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
I've asked enrico to run again, as this dump doesn't span the time when the issue started occurring. What I'm looking for is to be able to inspect the first packet which triggers the version downgrade of the decoders. On Fri, Mar 9, 2018 at 3:04 PM, Enrico Olivelli wrote: > This is the dump > >

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Il ven 9 mar 2018, 14:12 Ivan Kelly ha scritto: > > Any suggestion on the tcpdump config ? (command line example) > > sudo tcpdump -s 200 -w blah.pcap 'tcp port 3181' > > Where are you going to change the netty? client or server or both? > Both, as the application is packaged as a single bundle.

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
> Any suggestion on the tcpdump config ? (command line example) sudo tcpdump -s 200 -w blah.pcap 'tcp port 3181' Where are you going to change the netty? client or server or both? -Ivan

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
2018-03-09 13:48 GMT+01:00 Ivan Kelly : > Great analysis Sijie. > > Enrico, are these high traffic machines? Would it be feasible to put > tcpdump running? You could even truncate each message to 100 bytes or > so, to avoid storing payloads. It'd be very useful to see what the > corrupt traffic ac

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
Great analysis Sijie. Enrico, are these high traffic machines? Would it be feasible to put tcpdump running? You could even truncate each message to 100 bytes or so, to avoid storing payloads. It'd be very useful to see what the corrupt traffic actually looks like. -Ivan On Fri, Mar 9, 2018 at 10

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Reverted to Netty 4.1.12. System is "more" stable but after "some" restart we still have errors on client side on tailing readers, rebooting the JMV "resolved" temporary the problem. I have no more errors on the Bookie side My idea: - client is reading from 2 bookies, there is some bug in this a

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
2018-03-09 8:59 GMT+01:00 Sijie Guo : > Sent out a PR for the issues that I observed: > > https://github.com/apache/bookkeeper/pull/1240 > Other findings: - my problem is not related to jdk9, it happens with jdk8 too - the "tailing reader" is able to make progress and follow the WAL, so not all

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Sijie Guo
Sent out a PR for the issues that I observed: https://github.com/apache/bookkeeper/pull/1240 On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo wrote: > So the problem here is: > > - a corrupted request failed the V3 request decoder, so bookie switched to > use v2 request decoder. Once the switch happe

Re: Help with bad errors on 4.6.1

2018-03-08 Thread Enrico Olivelli
(switch to dev@) @Sijie very good explanation. I am back to work I we have found errors even on a client reader which is performing tailing reads -03-09-08-34-19io.netty.handler.codec.DecoderException: java.lang.IllegalStateException: Received unknown response : op code = 9 io.netty.handle