End of this story
With this patch the problem does not occur anymore
https://github.com/apache/bookkeeper/pull/1293
The patch does not address directly the problem, the root source is still
unknown, this is very bad. But with that change no error is reported
anymore, so actually it is enough to g
resending (GMAIL webmail messed up prev email)
2018-03-16 10:34 GMT+01:00 Sijie Guo :
> On Fri, Mar 16, 2018 at 2:26 AM, Enrico Olivelli
> wrote:
>
> > Thank you Sijie,
> > I have already applied a similar patch to my local code based on 4.6.1
> but
> > the problem remains.
> >
>
> What do you
2018-03-16 10:34 GMT+01:00 Sijie Guo :
> On Fri, Mar 16, 2018 at 2:26 AM, Enrico Olivelli
> wrote:
>
> > Thank you Sijie,
> > I have already applied a similar patch to my local code based on 4.6.1
> but
> > the problem remains.
> >
>
> What do you mean "the problem" here?
>
missing buf.release()
On Fri, Mar 16, 2018 at 2:26 AM, Enrico Olivelli
wrote:
> Thank you Sijie,
> I have already applied a similar patch to my local code based on 4.6.1 but
> the problem remains.
>
What do you mean "the problem" here?
Do you mean the corruption problem or the leaking problem? The change I
pointed o
Thank you Sijie,
I have already applied a similar patch to my local code based on 4.6.1 but
the problem remains.
I am looking into Netty allocateUnitializedArray, which is used for Pooled
Heap Buffers.
You all are more aware of BK code than me, is there any point in which we
assume that the buffer
> With "paranoid" log in Netty I found this that is very interesting, but it
> happens even on Java 8.
I don't think leaks are the problem here though. This seems to be more
like a doublefree issue.
-Ivan
On Fri, Mar 16, 2018 at 1:24 AM, Enrico Olivelli
wrote:
> 2018-03-15 12:02 GMT+01:00 Enrico Olivelli :
>
> >
> >
> > 2018-03-15 11:13 GMT+01:00 Ivan Kelly :
> >
> >> > What is the difference in Channel#write/ByteBuf pooling.in Java 9
> ?
> >> Sounds like it could be an issue in netty itself.
2018-03-15 12:02 GMT+01:00 Enrico Olivelli :
>
>
> 2018-03-15 11:13 GMT+01:00 Ivan Kelly :
>
>> > What is the difference in Channel#write/ByteBuf pooling.in Java 9 ?
>> Sounds like it could be an issue in netty itself. Java 9 removed a
>> bunch of stuff around Unsafe, which I'm pretty sure net
2018-03-15 11:13 GMT+01:00 Ivan Kelly :
> > What is the difference in Channel#write/ByteBuf pooling.in Java 9 ?
> Sounds like it could be an issue in netty itself. Java 9 removed a
> bunch of stuff around Unsafe, which I'm pretty sure netty was using
> for ByteBuf. Have you tried setting the p
> What is the difference in Channel#write/ByteBuf pooling.in Java 9 ?
Sounds like it could be an issue in netty itself. Java 9 removed a
bunch of stuff around Unsafe, which I'm pretty sure netty was using
for ByteBuf. Have you tried setting the pool debugging to paranoid?
-Dio.netty.leakDetect
Very latest news:
I have narrowed the problem to ResponseEnDecoderV3#encode, using
UnpooledByteBufAllocator.DEFAULT instead of the allocator from the channel
the error disappear.
So the problem is about the encoding of the responses, using Java 9 and
Pooled Byte Bufs.
This is compatible with the e
Latest findings, some good news, and some very bad.
Good news:
I was wrong, I did not switch back the system to Java 8 correcly.
The problem is on Bookie side and occours only if the bookie in on Java 9.
Bad news:
I have a fix. The fix to use Unpooled ByteBufs in serializeProtobuf:
private stat
>> > @Ivan
>> > I wonder if some tests on Jepsen with bookie restarts may find this kind
>> of
>> > issues, given that it is not a network/SO problem
>> If jepsen can catch then normal integration test can.
I attempted a repro for this using the integration test stuff.
Running for 2-3 hours in a l
2018-03-13 17:19 GMT+01:00 Ivan Kelly :
> > @Ivan
> > I wonder if some tests on Jepsen with bookie restarts may find this kind
> of
> > issues, given that it is not a network/SO problem
> If jepsen can catch then normal integration test can. The readers in
> question, are they tailing with long po
> @Ivan
> I wonder if some tests on Jepsen with bookie restarts may find this kind of
> issues, given that it is not a network/SO problem
If jepsen can catch then normal integration test can. The readers in
question, are they tailing with long poll, or just calling
readLastAddConfirmed in a loop? W
Findings of today:
A - the system fails even with BK 4.6.0
B - we have moved all the clients and the bookies to different machines
(keeping the same ZK cluster), same problem
C - I have copies of the application which are running on other similar
machines (on the same Blade/VMWare system)
D - I hav
Il lun 12 mar 2018, 20:40 Ivan Kelly ha scritto:
> > It is interesting that the problems is on 'readers' and it seems that the
> > PCBC seems corrupted and even writes (if the broker is promoted to
> > 'leader') are able to go on after the reads broke the client.
> Are writes coming from the same
> It is interesting that the problems is on 'readers' and it seems that the
> PCBC seems corrupted and even writes (if the broker is promoted to
> 'leader') are able to go on after the reads broke the client.
Are writes coming from the same clients? Or clients in the same process?
-Ivan
Il lun 12 mar 2018, 19:37 Sijie Guo ha scritto:
> Thanks Enrico!
>
> On Mon, Mar 12, 2018 at 4:21 AM, Enrico Olivelli
> wrote:
>
> > Summary of my findings:
> >
> > The problem is about clients which get messed up and are not able to read
> > and write to bookies after rolling restarts of an app
Thanks Enrico!
On Mon, Mar 12, 2018 at 4:21 AM, Enrico Olivelli
wrote:
> Summary of my findings:
>
> The problem is about clients which get messed up and are not able to read
> and write to bookies after rolling restarts of an application,
> the problem appears only on a cluster of 6 machines (r
> - when I "restart" bookies I issue a kill -9 (I think this could be the
> reason why I can't reproduce the issue on testcases)
With a clean shutdown of bookies we close the channels, and it should
do the tcp shutdown handshake. -9 will kill the process before it gets
to do any of that, but the ke
Summary of my findings:
The problem is about clients which get messed up and are not able to read
and write to bookies after rolling restarts of an application,
the problem appears only on a cluster of 6 machines (reduced to 3 in order
to narrow down the search) of my colleagues which are performi
I will send a report soon.
With new debug I have some finding, I am looking into problems during
restarts of bookies. Maybe there is some problem in error handling in PCBC.
Thank you
Enrico
2018-03-12 10:58 GMT+01:00 Ivan Kelly :
> Enrico, could you summarize what the state of things is now? Wha
Enrico, could you summarize what the state of things is now? What are
you running, what problems are you seeing and how are the problems
manifesting themselves.
Regards,
Ivan
On Mon, Mar 12, 2018 at 10:15 AM, Enrico Olivelli wrote:
> Applyed Sijie's fixes and added some debug:
>
> Problem is tri
Applyed Sijie's fixes and added some debug:
Problem is triggered when you restart a bookie (I have a cluster of 3
bookies, WQ = 2 and AQ = 2)
Below a new error on client side ("tailing" reader)
Enrico
this is a new error on client side:
18-03-12-09-11-45Unexpected exception caught by bookie
Enrico,
I would suggest you applied my fixes and then debug from there. In this
way, you will have a better sense where the first corruption is from.
Sijie
On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli wrote:
> Il ven 9 mar 2018, 19:30 Enrico Olivelli ha scritto:
>
> > Thank you Ivan!
> > I
Il ven 9 mar 2018, 19:30 Enrico Olivelli ha scritto:
> Thank you Ivan!
> I hope I did not mess up the dump and added ZK ports. We are not using
> standard ports and in that 3 machines there is also the 3 nodes zk
> ensemble which is supporting BK and all the other parts of the application
>
> S
Thank you Ivan!
I hope I did not mess up the dump and added ZK ports. We are not using
standard ports and in that 3 machines there is also the 3 nodes zk
ensemble which is supporting BK and all the other parts of the application
So one explanation would be that something is connecting to the boo
On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli wrote:
> Bookies
> 10.168.10.117:1822 -> bad bookie with 4.1.21
> 10.168.10.116:1822 -> bookie with 4.1.12
> 10.168.10.118:1281 -> bookie with 4.1.12
>
> 10.168.10.117 client machine on which I have 4.1.21 client (different
> process than the bookie
Bookies
10.168.10.117:1822 -> bad bookie with 4.1.21
10.168.10.116:1822 -> bookie with 4.1.12
10.168.10.118:1281 -> bookie with 4.1.12
10.168.10.117 client machine on which I have 4.1.21 client (different
process than the bookie one)
Thanks
Enrico
2018-03-09 15:16 GMT+01:00 Ivan Kelly :
> On
Also, do you have the logs of the error occurring on the server side?
-Ivan
On Fri, Mar 9, 2018 at 3:16 PM, Ivan Kelly wrote:
> On Fri, Mar 9, 2018 at 3:13 PM, Enrico Olivelli wrote:
>> New dump,
>> sequence (simpler)
>>
>> 1) system is running, reader is reading without errors with netty 4.1.2
On Fri, Mar 9, 2018 at 3:13 PM, Enrico Olivelli wrote:
> New dump,
> sequence (simpler)
>
> 1) system is running, reader is reading without errors with netty 4.1.21
> 2) 3 bookies, one is with 4.1.21 and the other ones with 4.1.12
> 3) kill one bookie with 4.1.12, the reader starts reading from th
New dump,
sequence (simpler)
1) system is running, reader is reading without errors with netty 4.1.21
2) 3 bookies, one is with 4.1.21 and the other ones with 4.1.12
3) kill one bookie with 4.1.12, the reader starts reading from the bookie
with 4.1.21
4) client messes up, unrecoverably
Enrico
2
I've asked enrico to run again, as this dump doesn't span the time
when the issue started occurring.
What I'm looking for is to be able to inspect the first packet which
triggers the version downgrade of the decoders.
On Fri, Mar 9, 2018 at 3:04 PM, Enrico Olivelli wrote:
> This is the dump
>
>
Il ven 9 mar 2018, 14:12 Ivan Kelly ha scritto:
> > Any suggestion on the tcpdump config ? (command line example)
>
> sudo tcpdump -s 200 -w blah.pcap 'tcp port 3181'
>
> Where are you going to change the netty? client or server or both?
>
Both, as the application is packaged as a single bundle.
> Any suggestion on the tcpdump config ? (command line example)
sudo tcpdump -s 200 -w blah.pcap 'tcp port 3181'
Where are you going to change the netty? client or server or both?
-Ivan
2018-03-09 13:48 GMT+01:00 Ivan Kelly :
> Great analysis Sijie.
>
> Enrico, are these high traffic machines? Would it be feasible to put
> tcpdump running? You could even truncate each message to 100 bytes or
> so, to avoid storing payloads. It'd be very useful to see what the
> corrupt traffic ac
Great analysis Sijie.
Enrico, are these high traffic machines? Would it be feasible to put
tcpdump running? You could even truncate each message to 100 bytes or
so, to avoid storing payloads. It'd be very useful to see what the
corrupt traffic actually looks like.
-Ivan
On Fri, Mar 9, 2018 at 10
Reverted to Netty 4.1.12. System is "more" stable but after "some" restart
we still have errors on client side on tailing readers, rebooting the JMV
"resolved" temporary the problem.
I have no more errors on the Bookie side
My idea:
- client is reading from 2 bookies, there is some bug in this a
2018-03-09 8:59 GMT+01:00 Sijie Guo :
> Sent out a PR for the issues that I observed:
>
> https://github.com/apache/bookkeeper/pull/1240
>
Other findings:
- my problem is not related to jdk9, it happens with jdk8 too
- the "tailing reader" is able to make progress and follow the WAL, so not
all
Sent out a PR for the issues that I observed:
https://github.com/apache/bookkeeper/pull/1240
On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo wrote:
> So the problem here is:
>
> - a corrupted request failed the V3 request decoder, so bookie switched to
> use v2 request decoder. Once the switch happe
(switch to dev@)
@Sijie very good explanation.
I am back to work I we have found errors even on a client reader which is
performing tailing reads
-03-09-08-34-19io.netty.handler.codec.DecoderException:
java.lang.IllegalStateException: Received unknown response : op code = 9
io.netty.handle
42 matches
Mail list logo