Re: Help with bad errors on 4.6.1

2018-03-09 Thread Sijie Guo
Enrico, I would suggest you applied my fixes and then debug from there. In this way, you will have a better sense where the first corruption is from. Sijie On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli wrote: > Il ven 9 mar 2018, 19:30 Enrico Olivelli ha scritto: > > > Thank you Ivan! > > I

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Il ven 9 mar 2018, 19:30 Enrico Olivelli ha scritto: > Thank you Ivan! > I hope I did not mess up the dump and added ZK ports. We are not using > standard ports and in that 3 machines there is also the 3 nodes zk > ensemble which is supporting BK and all the other parts of the application > > S

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Thank you Ivan! I hope I did not mess up the dump and added ZK ports. We are not using standard ports and in that 3 machines there is also the 3 nodes zk ensemble which is supporting BK and all the other parts of the application So one explanation would be that something is connecting to the boo

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli wrote: > Bookies > 10.168.10.117:1822 -> bad bookie with 4.1.21 > 10.168.10.116:1822 -> bookie with 4.1.12 > 10.168.10.118:1281 -> bookie with 4.1.12 > > 10.168.10.117 client machine on which I have 4.1.21 client (different > process than the bookie

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Bookies 10.168.10.117:1822 -> bad bookie with 4.1.21 10.168.10.116:1822 -> bookie with 4.1.12 10.168.10.118:1281 -> bookie with 4.1.12 10.168.10.117 client machine on which I have 4.1.21 client (different process than the bookie one) Thanks Enrico 2018-03-09 15:16 GMT+01:00 Ivan Kelly : > On

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
Also, do you have the logs of the error occurring on the server side? -Ivan On Fri, Mar 9, 2018 at 3:16 PM, Ivan Kelly wrote: > On Fri, Mar 9, 2018 at 3:13 PM, Enrico Olivelli wrote: >> New dump, >> sequence (simpler) >> >> 1) system is running, reader is reading without errors with netty 4.1.2

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
On Fri, Mar 9, 2018 at 3:13 PM, Enrico Olivelli wrote: > New dump, > sequence (simpler) > > 1) system is running, reader is reading without errors with netty 4.1.21 > 2) 3 bookies, one is with 4.1.21 and the other ones with 4.1.12 > 3) kill one bookie with 4.1.12, the reader starts reading from th

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
New dump, sequence (simpler) 1) system is running, reader is reading without errors with netty 4.1.21 2) 3 bookies, one is with 4.1.21 and the other ones with 4.1.12 3) kill one bookie with 4.1.12, the reader starts reading from the bookie with 4.1.21 4) client messes up, unrecoverably Enrico 2

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
I've asked enrico to run again, as this dump doesn't span the time when the issue started occurring. What I'm looking for is to be able to inspect the first packet which triggers the version downgrade of the decoders. On Fri, Mar 9, 2018 at 3:04 PM, Enrico Olivelli wrote: > This is the dump > >

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Il ven 9 mar 2018, 14:12 Ivan Kelly ha scritto: > > Any suggestion on the tcpdump config ? (command line example) > > sudo tcpdump -s 200 -w blah.pcap 'tcp port 3181' > > Where are you going to change the netty? client or server or both? > Both, as the application is packaged as a single bundle.

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
> Any suggestion on the tcpdump config ? (command line example) sudo tcpdump -s 200 -w blah.pcap 'tcp port 3181' Where are you going to change the netty? client or server or both? -Ivan

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
2018-03-09 13:48 GMT+01:00 Ivan Kelly : > Great analysis Sijie. > > Enrico, are these high traffic machines? Would it be feasible to put > tcpdump running? You could even truncate each message to 100 bytes or > so, to avoid storing payloads. It'd be very useful to see what the > corrupt traffic ac

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Ivan Kelly
Great analysis Sijie. Enrico, are these high traffic machines? Would it be feasible to put tcpdump running? You could even truncate each message to 100 bytes or so, to avoid storing payloads. It'd be very useful to see what the corrupt traffic actually looks like. -Ivan On Fri, Mar 9, 2018 at 10

Re: Replication Worker and targetBookie.

2018-03-09 Thread Ivan Kelly
> The "predicate" approach is problematic, it can potentially cause some > ledgers never being replicated. Ideally, this is something should be done > by auditor, because auditor > knows the ledgers, the alive bookies and the network topology, auditor > should be able to compute a replication plan

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
Reverted to Netty 4.1.12. System is "more" stable but after "some" restart we still have errors on client side on tailing readers, rebooting the JMV "resolved" temporary the problem. I have no more errors on the Bookie side My idea: - client is reading from 2 bookies, there is some bug in this a

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Enrico Olivelli
2018-03-09 8:59 GMT+01:00 Sijie Guo : > Sent out a PR for the issues that I observed: > > https://github.com/apache/bookkeeper/pull/1240 > Other findings: - my problem is not related to jdk9, it happens with jdk8 too - the "tailing reader" is able to make progress and follow the WAL, so not all

Re: Replication Worker and targetBookie.

2018-03-09 Thread Sijie Guo
On Thu, Mar 8, 2018 at 11:46 AM, Venkateswara Rao Jujjuri wrote: > On Thu, Mar 8, 2018 at 11:33 AM, Sijie Guo wrote: > > > On Thu, Mar 8, 2018 at 8:07 AM, Venkateswara Rao Jujjuri < > > jujj...@gmail.com> > > wrote: > > > > > On Thu, Mar 8, 2018 at 2:38 AM, Ivan Kelly wrote: > > > > > > > > Giv

Re: Help with bad errors on 4.6.1

2018-03-09 Thread Sijie Guo
Sent out a PR for the issues that I observed: https://github.com/apache/bookkeeper/pull/1240 On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo wrote: > So the problem here is: > > - a corrupted request failed the V3 request decoder, so bookie switched to > use v2 request decoder. Once the switch happe