Slack digest for #dev - 2020-05-16

Apache Pulsar Slack Sat, 16 May 2020 02:11:24 -0700

2020-05-15 18:53:02 UTC - Sijie Guo: @Addison Higham - the PCBC pool in 
bookkeeper client is keyed by bookie id. so in k8s, it is the hostname + port. 
Currernt implement relies on netty (which relies on OS) telling the client that 
a connection is broken. So when the client detects the connection is broken. It 
will then re-establish the client to the bookie id (hostname + port).

s
----
2020-05-15 21:15:14 UTC - Addison Higham: yeah so it seems like where things go
wrong is that because the IP changes and it is hostname + port, it can take
quite a while for OS to detect those dead connections. Specifically, it looks
like because netty uses OS level tcp keep alive, it keeps the connection open
for up to 11 minutes (75 seconds * 9 failed probes which is the OS default)
before it marks the socket as dead. While we could tune that... it does seem
less than ideal.

It seems like there are two main options to make the BK client be a bit smarter:
1. when we detect a removed bookie OR a modification to existing bookie, we
close the netty connection pool and force it to re-establish connections. This
could possibly be gated additionally with specifically looking to see if the
resolved IP changes.
2. failure logic that includes closing and re-establishing connections after so
many retries.

Given those two options, I am sort of leading toward 1, but would be curious of
your perspective @Sijie Guo

Additionally, one question is why when a bookkeeper is shutting down it doesn't
gracefully close it's client connections, which should make this a lot better
for scheduled maintenance/etc as the client would be forced to re-establish
connections, but still in the case of unexpected failure, that is a long time
to wait for netty to be hanging.

I am wondering if we are missing some detail here, like perhaps netty does it's
own DNS caching? Certainly still some things I don't quite understand there
----
2020-05-15 21:30:23 UTC - Addison Higham: oh 3rd interesting option: Bookkeeper
client periodically does a DNS resolution and if it detects an IP change it
re-establishes the connections. Though I wonder if netty could do that...
----
2020-05-15 21:32:05 UTC - Matteo Merli: &gt; Though I wonder if netty could do
that...
You can do that, by having app-level probes in the connections. We do that from
client-to-broker to detect stale connections
----
2020-05-15 21:42:38 UTC - Addison Higham: oh nifty, got a pointer to where that
happens in client code?
----
2020-05-15 21:42:53 UTC - Matteo Merli: Sure, just one sec
----
2020-05-15 21:43:01 UTC - Addison Higham: :bow:
----
2020-05-15 21:44:30 UTC - Addison Higham: what I think we will do to validate
this before going down a path like that is to set the OS tcp keepalive
parameters to be much more aggressive, maybe like 3 probes with 20 second
interval. If that behaves better, then I think it makes sense to get BK client
to just be more aggressive in closing sockets in the event of timeouts/probes
whatever
----
2020-05-15 21:45:03 UTC - Matteo Merli:
<https://github.com/apache/pulsar/blob/d9c0a1054310e9428007d016895d4174b0d20f89/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/PulsarHandler.java#L75>
----
2020-05-15 21:45:59 UTC - Matteo Merli: basically, it works both ways, client
and broker exchange ping requests and pong responses. If there's not response
in 1min, the connection is stale
----
2020-05-15 21:46:06 UTC - Matteo Merli: and we forcifully close it
----
2020-05-15 21:49:34 UTC - Matteo Merli: TCP keepalive is a bit unfriendly since
it depends on where you get to deploy, etc.. and, at least in case of clients,
of things that are out of our control. We were trying to use that, back in the
day, before implementing it directly, where we have all the control.
----
2020-05-15 21:50:25 UTC - Matteo Merli: I believe there are also limits on the
keepalive settings. (like in: cannot be set too low)
----

Slack digest for #dev - 2020-05-16

Reply via email to