2020-02-19 19:50:33 UTC - Lars Norén: @Lars Norén has joined the channel
----
2020-02-19 20:18:04 UTC - Devin G. Bost: I experienced a test failure in this 
test:
`testBrokerSelectionForAntiAffinityGroup(org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest)`

Will someone please explain what is meant by a “Namespace AntiAffinity Group” 
and what the responsibility of the load manager is? This information will help 
me debug this test failure.
(I haven’t been able to reach the dev who wrote the test.)
----
2020-02-19 20:19:25 UTC - Devin G. Bost: I created an issue here for the test 
failure: <https://github.com/apache/pulsar/issues/6368>
----
2020-02-19 23:40:57 UTC - Addison Higham: hey so speak of the devil, I also 
just had a non-responsive topic, producers were failing to create and trying to 
make any API call (like the stats call) resulted in a 500, even when talking 
directly to the broker who had ownership of the topic. Wasn't able to see any 
exceptions in the logs, but I did capture a heap dump as well as a snapshot of 
everything in ZK, so *hopefully* I should be able to find something there
----
2020-02-19 23:41:51 UTC - Devin G. Bost: Oh, great! (I mean, “I’m sorry that 
happened to you…“) But, I’m glad that you ran into this same issue.
Will you please put the details into the Github issue?
----
2020-02-19 23:41:59 UTC - Devin G. Bost: That will help us keep all the 
information in one place for collaboration.
----
2020-02-19 23:42:23 UTC - Addison Higham: once I get done analyzing the heap 
dump I will post what I find
----
2020-02-19 23:42:23 UTC - Devin G. Bost: I bet ZK will have some interesting 
information.
----
2020-02-19 23:43:43 UTC - Devin G. Bost: I really appreciate your help on the 
issue.
If you see anything odd, please let me know.  :slightly_smiling_face:
----
2020-02-20 02:54:18 UTC - Joe Francis: 
<https://github.com/apache/pulsar/issues/840>
----
2020-02-20 03:15:35 UTC - Devin G. Bost: Thanks!
----
2020-02-20 05:42:27 UTC - Ravi Shah: @Sijie Guo I am using apache-pulsar-2.4.2.
----
2020-02-20 05:42:45 UTC - Ravi Shah: I have consumers which are consuming the 
messages at the same time
----
2020-02-20 05:43:13 UTC - Ravi Shah: but i guess producer is producing faster 
then the consumer rate, am i correct
----
2020-02-20 05:43:14 UTC - Ravi Shah: ?
----
2020-02-20 05:45:01 UTC - Devin G. Bost: &gt; I have consumers which are 
consuming the messages at the same time
Could you please provide more detail about what you mean? Is each consumer 
getting the same message? Are some consumers getting different messages than 
other consumers?
----
2020-02-20 05:45:30 UTC - Devin G. Bost: &gt; but i guess producer is producing 
faster then the consumer rate, am i correct
Can you provide the results of checking topic stats?
----
2020-02-20 06:15:54 UTC - Ravi Shah: @Sijie Guo is there any way to increase 
backlog quota?
----
2020-02-20 06:16:39 UTC - Devin G. Bost: Yes, it's in the docs
----
2020-02-20 06:29:36 UTC - Addison Higham: so, wasn't able to find much in the 
heap dump I made yet, but, some code analysis makes me thing it is *likely* in 
the `org.apache.pulsar.broker.service.BrokerService#getTopic` call.
Here is why I think that:
• both creating a producer and calling `pulsar-admin topics stats` on the 
effected topic stalled out
• looking at the code, both methods show that they call the authService and 
then call the `getTopic` method on the `BrokerService`
• however, looking at the logs, I can see when we create a producer we get 
through the auth checks, as far as I can tell, that just leaves 
`BrokerService#getTopic` as the only common code path
• a look at that code, I don't see anything obvious... but the most suspicious 
part is perhaps the calls in `pendingTopicLoadQueue` , it seems like there 
could be issues with acquiring the semaphore
----
2020-02-20 06:45:58 UTC - Addison Higham: ah that semaphore has a pretty high 
count, so likely not
----
2020-02-20 06:46:24 UTC - Devin G. Bost: &gt;  ah that semaphore has a pretty 
high count, so likely not
Could you please explain?  :slightly_smiling_face:
----
2020-02-20 06:46:54 UTC - Devin G. Bost: Oh, you’re just saying that the 
semaphore is getting acquired just fine.
----
2020-02-20 06:47:32 UTC - Addison Higham: 
`org.apache.pulsar.broker.ServiceConfiguration#maxConcurrentTopicLoadRequest` 
has a default of 5000, so unless you are loading more than 5000 topics at once, 
likely not an issue you would run into
----
2020-02-20 06:48:44 UTC - Devin G. Bost: Interesting.
----
2020-02-20 06:49:11 UTC - Devin G. Bost: Are you still thinking 
`BrokerService#getTopic` is suspicious, or are you looking elsewhere at this 
point?
----
2020-02-20 06:50:18 UTC - Addison Higham: still seems likely, but that call 
goes pretty deep, so could be many things there, realistically I probably need 
to catch in the act again, I was dumb and didn't grab a thread dump
----
2020-02-20 06:53:14 UTC - Devin G. Bost: Whenever we’ve had the freezing topic 
issue occur in prod, it’s usually generated enough mayhem that it’s been hard 
for us to capture good logs and other details around the event because usually 
everyone is too focused on just trying to get things back to a healthy state.
----

Reply via email to