Hi,
I don't have enough context to know if replacing ZkClient is important
right now. However, I did take a look at the code to see how extensively
ZkClient gets used and I agree with Gwen that replacing it is a bigger
task and will need further testing too, to ensure they don't have issues
of their own, which affect us.
IMO, using the newer version of ZkClient which the ZkClient team are
willing to release might be a good idea for resolving the immediate
issues at hand. So I think running certain tests, using the current
dev/snapshot version of ZkClient, to verify that the JIRAs that we
expect to be resolved are indeed resolved and then asking the ZkClient
team to do a release might be something that we should do. If this
sounds good and if someone can point me to the exact JIRAs that need to
be verified, then I can look into this. Let me know.
-Jaikiran
On Thursday 05 February 2015 02:51 AM, Gwen Shapira wrote:
Hi,
KAFKA-1155 is likely Zookeeper and not the specific client.
I believe the rest are already fixed in ZKClient and its a matter of asking
them to release, rebase our code and make sure the issues are resolved (or
that we use the features ZKClient added to resolve them).
I'm a fan of Curator, but its not exactly a drop-in replacement for
ZKClient (the APIs are slightly different, if we even decide to just use
the APIs and not the recipes). I suspect that replacing ZKClient with
Curator is a large project. Perhaps too large to resolve 3 issues that are
already resolved in ZKClient.
What are the benefits you guys see in the replacement?
Gwen
On Tue, Feb 3, 2015 at 10:42 PM, Guozhang Wang <wangg...@gmail.com> wrote:
Now may be a good time.
We could verify if Curator has fixed the known issues we have seen so far,
an incomplete list would be:
KAFKA-1082 <https://issues.apache.org/jira/browse/KAFKA-1082>
KAFKA-1155 <https://issues.apache.org/jira/browse/KAFKA-1155>
KAFKA-1907 <https://issues.apache.org/jira/browse/KAFKA-1907>
KAFKA-992 <https://issues.apache.org/jira/browse/KAFKA-992>
Guozhang
On Tue, Feb 3, 2015 at 10:21 PM, Ashish Singh <asi...@cloudera.com> wrote:
+1 on using curator.
On Tue, Feb 3, 2015 at 10:09 PM, Manikumar Reddy <ku...@nmsworks.co.in>
wrote:
I think we should consider to moving to apache curator (KAFKA-873).
Curator is now more mature and a apache top-level project.
On Wed, Feb 4, 2015 at 11:29 AM, Harsha <ka...@harsha.io> wrote:
Any reason not to go with apache curator http://curator.apache.org/
.
-Harsha
On Tue, Feb 3, 2015, at 09:55 PM, Guozhang Wang wrote:
I am also +1 on Neha's suggestion that "At some point, if we find
ourselves
fiddling too much with ZkClient, it wouldn't hurt to write our own
little
zookeeper client wrapper." since we have accumulated a bunch of
issues
with
zkClient which takes long time be resolved if ever, so we ended up
have
some hacky way handling zkClient errors.
Guozhang
On Tue, Feb 3, 2015 at 7:47 PM, Jaikiran Pai <
jai.forums2...@gmail.com
wrote:
Yes, that's the plan :)
-Jaikiran
On Wednesday 04 February 2015 12:33 AM, Gwen Shapira wrote:
So I think the current plan is:
1. Add timeout in zkclient
2. Ask zkclient to release new version (we need it for few other
things
too)
3. Rebase on new zkclient
4. Fix this jira and the few others than were waiting for the
new
zkclient
Does that make sense?
Gwen
On Mon, Feb 2, 2015 at 8:33 PM, Jaikiran Pai <
jai.forums2...@gmail.com>
wrote:
I just heard back from Stefan, who manages the ZkClient repo
and
he
seems to
be open to have these changes be part of ZkClient project. I'll
be
creating
a pull request for that project to have it reviewed and merged.
Although
I
haven't heard of exact release plans, Stefan's reply did
indicate
that
the
project could be released after this change is merged.
-Jaikiran
On Tuesday 03 February 2015 09:03 AM, Jaikiran Pai wrote:
Thanks for pointing to that repo!
I just had a look at it and it appears that the project isn't
much
active
(going by the lack of activity). The latest contribution is
from
Gwen
and
that was around 3 months back. I haven't found release plans
for
that
project or a place to ask about it (filing an issue doesn't
seem
right
to
ask this question). So I'll get in touch with the repo owner
and
see
what
his plans for the project are.
-Jaikiran
On Monday 02 February 2015 11:33 PM, Gwen Shapira wrote:
I did!
Thanks for clarifying :)
The client that is part of Zookeeper itself actually does
support
timeouts.
On Mon, Feb 2, 2015 at 9:54 AM, Guozhang Wang <
wangg...@gmail.com>
wrote:
Hi Jaikiran,
I think Gwen was talking about contributing to ZkClient
project:
https://github.com/sgroschupf/zkclient
Guozhang
On Sun, Feb 1, 2015 at 5:30 AM, Jaikiran Pai <
jai.forums2...@gmail.com>
wrote:
Hi Gwen,
Yes, the KafkaZkClient is a wrapper around ZkClient and
not a
complete
replacement.
As for contributing to Zookeeper, yes that indeed in on my
mind,
but
I
haven't yet had a chance to really look deeper into
Zookeeper
or
get
in
touch with their dev team to try and explain this potential
improvement
to
them. I have no objection to contributing this or something
similar
to
Zookeeper directly. I think I should be able to bring this
up
in
the
Zookeeper dev forum, sometime soon in the next few
weekends.
-Jaikiran
On Sunday 01 February 2015 11:40 AM, Gwen Shapira wrote:
It looks like the new KafkaZkClient is a wrapper around
ZkClient,
but
not a replacement. Did I get it right?
I think a wrapper for ZkClient can be useful - for example
KAFKA-1664
can also use one.
However, I'm wondering why not contribute the fix directly
to
ZKClient
project and ask for a release that contains the fix?
This will benefit other users of the project who may also
need a
timeout (thats pretty basic...)
As an alternative, if we don't want to collaborate with
ZKClient for
some reason, forking the project into Kafka will probably
give
us
more
control than wrappers and without much downside.
Just a thought.
Gwen
On Sat, Jan 31, 2015 at 6:32 AM, Jaikiran Pai
<jai.forums2...@gmail.com>
wrote:
Neha, Ewen (and others), my initial attempt to solve this
is
uploaded
here
https://reviews.apache.org/r/30477/. It solves the
shutdown
problem
and
now
the server shuts down even when Zookeeper has gone down
before
the
Kafka
server.
I went with the approach of introducing a custom
(enhanced)
ZkClient
which
for now allows time outs to be optionally specified for
certain
operations.
I intentionally haven't forced the use of this new
KafkaZkClient
all
over
the code and instead for now have just used it in the
KafkaServer.
Does this patch look like something worth using?
-Jaikiran
On Thursday 29 January 2015 10:41 PM, Neha Narkhede
wrote:
Ewen is right. ZkClient APIs are blocking and the right
fix
for
this
seems
to be patching ZkClient. At some point, if we find
ourselves
fiddling
too
much with ZkClient, it wouldn't hurt to write our own
little
zookeeper
client wrapper.
On Thu, Jan 29, 2015 at 12:57 AM, Ewen Cheslack-Postava
<e...@confluent.io>
wrote:
Looks like a bug to me -- the underlying ZK library
wraps a
lot of
blocking
method implementations with waitUntilConnected() calls
without
any
timeouts. Ideally we could just add a version of
ZkUtils.getController()
with a timeout, but I don't see an easy way to
accomplish
that
with
ZkClient.
There's at least one other call to ZkUtils besides the
one
in the
stacktrace you gave that would cause the same issue,
possibly
more
that
aren't directly called in that method. One ugly
solution
would be
to
use
an
extra thread during shutdown to trigger timeouts, but
I'd
imagine
we
probably have other threads that could end up blocking
in
similar
ways.
I filed
https://issues.apache.org/jira/browse/KAFKA-1907
to
track
the
issue.
On Mon, Jan 26, 2015 at 6:35 AM, Jaikiran Pai <
jai.forums2...@gmail.com>
wrote:
The main culprit is this thread which goes into
"forever
retry
connection
to a closed zookeeper" when I shutdown Kafka (via a
Ctrl +
C)
after
zookeeper has already been shutdown. I have attached
the
complete
thread
dump, but I don't know if it will be delivered to the
mailing
list.
"Thread-2" prio=10 tid=0xb3305000 nid=0x4758 waiting
on
condition
[0x6ad69000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x70a93368> (a
java.util.concurrent.locks.
AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkUntil(
LockSupport.java:267)
at java.util.concurrent.locks.
AbstractQueuedSynchronizer$
ConditionObject.awaitUntil(AbstractQueuedSynchronizer.
java:2130)
at
org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.
java:636)
at
org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.
java:619)
at
org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.
java:615)
at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
java:679)
at org.I0Itec.zkclient.ZkClient.
readData(ZkClient.java:766)
at org.I0Itec.zkclient.ZkClient.
readData(ZkClient.java:761)
at
kafka.utils.ZkUtils$.readDataMaybeNull(ZkUtils.scala:456)
at
kafka.utils.ZkUtils$.getController(ZkUtils.scala:65)
at
kafka.server.KafkaServer.kafka$server$KafkaServer$$
controlledShutdown(KafkaServer.scala:194)
at kafka.server.KafkaServer$$
anonfun$shutdown$1.apply$mcV$
sp(KafkaServer.scala:269)
at kafka.utils.Utils$.swallow(Utils.scala:172)
at kafka.utils.Logging$class.
swallowWarn(Logging.scala:92)
at
kafka.utils.Utils$.swallowWarn(Utils.scala:45)
at
kafka.utils.Logging$class.swallow(Logging.scala:94)
at kafka.utils.Utils$.swallow(Utils.scala:45)
at
kafka.server.KafkaServer.shutdown(KafkaServer.scala:
269)
at kafka.server.KafkaServerStartable.shutdown(
KafkaServerStartable.scala:42)
at kafka.Kafka$$anon$1.run(Kafka.scala:42)
-Jaikiran
On Monday 26 January 2015 05:46 AM, Neha Narkhede
wrote:
For a clean shutdown, the broker tries to talk to
the
controller
and
also
issues reads to zookeeper. Possibly that is where it
tries
to
reconnect
to
zk. It will help to look at the thread dump.
Thanks
Neha
On Fri, Jan 23, 2015 at 8:53 PM, Jaikiran Pai <
jai.forums2...@gmail.com
wrote:
I was just playing around with the RC2 of 0.8.2
and
noticed
that
if I
shutdown zookeeper first I can't shutdown Kafka
server
at
all
since
it
goes
into a never ending attempt to reconnect with
zookeeper.
I had
to
kill
the
Kafka process to stop it. I tried it against trunk
too
and
there
too I
see
the same issue. Should I file a JIRA for this and
see
if
I can
come
up
with
a patch?
FWIW, here's the unending (and IMO too frequent)
attempts
at
trying
to
reconnect. I've a thread dump too which shows that
the
other
thread
which
is trying to complete a controlled shutdown of Kafka
is
blocked
forever
for
the zookeeper to be up. I can attach it to the JIRA.
2015-01-24 10:15:46,278] WARN Session
0x14b1a4136800000
for
server
null,
unexpected error, closing socket connection and
attempting
reconnect
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at
sun.nio.ch.SocketChannelImpl.checkConnect(Native
Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:739)
at
org.apache.zookeeper.ClientCnxnSocketNIO.
doTransport(
ClientCnxnSocketNIO.java:361)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(
ClientCnxn.java:1081)
[2015-01-24 10:15:47,437] INFO Opening socket
connection
to
server
localhost/127.0.0.1:2181. Will not attempt to
authenticate
using
SASL
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-01-24 10:15:47,438] WARN Session
0x14b1a4136800000
for
server
null,
unexpected error, closing socket connection and
attempting
reconnect
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at
sun.nio.ch.SocketChannelImpl.checkConnect(Native
Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:739)
at
org.apache.zookeeper.ClientCnxnSocketNIO.
doTransport(
ClientCnxnSocketNIO.java:361)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(
ClientCnxn.java:1081)
[2015-01-24 10:15:49,056] INFO Opening socket
connection
to
server
localhost/127.0.0.1:2181. Will not attempt to
authenticate
using
SASL
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-01-24 10:15:49,057] WARN Session
0x14b1a4136800000
for
server
null,
unexpected error, closing socket connection and
attempting
reconnect
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at
sun.nio.ch.SocketChannelImpl.checkConnect(Native
Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:739)
at
org.apache.zookeeper.ClientCnxnSocketNIO.
doTransport(
ClientCnxnSocketNIO.java:361)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(
ClientCnxn.java:1081)
[2015-01-24 10:15:50,801] INFO Opening socket
connection
to
server
localhost/127.0.0.1:2181. Will not attempt to
authenticate
using
SASL
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-01-24 10:15:50,802] WARN Session
0x14b1a4136800000
for
server
null,
unexpected error, closing socket connection and
attempting
reconnect
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at
sun.nio.ch.SocketChannelImpl.checkConnect(Native
Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:739)
at
org.apache.zookeeper.ClientCnxnSocketNIO.
doTransport(
ClientCnxnSocketNIO.java:361)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(
ClientCnxn.java:1081)
-Jaikiran
--
Thanks,
Ewen
--
-- Guozhang
--
-- Guozhang
--
Regards,
Ashish
--
-- Guozhang