2020-02-19 19:50:33 UTC - Lars Norén: @Lars Norén has joined the channel ---- 2020-02-19 20:18:04 UTC - Devin G. Bost: I experienced a test failure in this test: `testBrokerSelectionForAntiAffinityGroup(org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest)`
Will someone please explain what is meant by a “Namespace AntiAffinity Group” and what the responsibility of the load manager is? This information will help me debug this test failure. (I haven’t been able to reach the dev who wrote the test.) ---- 2020-02-19 20:19:25 UTC - Devin G. Bost: I created an issue here for the test failure: <https://github.com/apache/pulsar/issues/6368> ---- 2020-02-19 23:40:57 UTC - Addison Higham: hey so speak of the devil, I also just had a non-responsive topic, producers were failing to create and trying to make any API call (like the stats call) resulted in a 500, even when talking directly to the broker who had ownership of the topic. Wasn't able to see any exceptions in the logs, but I did capture a heap dump as well as a snapshot of everything in ZK, so *hopefully* I should be able to find something there ---- 2020-02-19 23:41:51 UTC - Devin G. Bost: Oh, great! (I mean, “I’m sorry that happened to you…“) But, I’m glad that you ran into this same issue. Will you please put the details into the Github issue? ---- 2020-02-19 23:41:59 UTC - Devin G. Bost: That will help us keep all the information in one place for collaboration. ---- 2020-02-19 23:42:23 UTC - Addison Higham: once I get done analyzing the heap dump I will post what I find ---- 2020-02-19 23:42:23 UTC - Devin G. Bost: I bet ZK will have some interesting information. ---- 2020-02-19 23:43:43 UTC - Devin G. Bost: I really appreciate your help on the issue. If you see anything odd, please let me know. :slightly_smiling_face: ---- 2020-02-20 02:54:18 UTC - Joe Francis: <https://github.com/apache/pulsar/issues/840> ---- 2020-02-20 03:15:35 UTC - Devin G. Bost: Thanks! ---- 2020-02-20 05:42:27 UTC - Ravi Shah: @Sijie Guo I am using apache-pulsar-2.4.2. ---- 2020-02-20 05:42:45 UTC - Ravi Shah: I have consumers which are consuming the messages at the same time ---- 2020-02-20 05:43:13 UTC - Ravi Shah: but i guess producer is producing faster then the consumer rate, am i correct ---- 2020-02-20 05:43:14 UTC - Ravi Shah: ? ---- 2020-02-20 05:45:01 UTC - Devin G. Bost: > I have consumers which are consuming the messages at the same time Could you please provide more detail about what you mean? Is each consumer getting the same message? Are some consumers getting different messages than other consumers? ---- 2020-02-20 05:45:30 UTC - Devin G. Bost: > but i guess producer is producing faster then the consumer rate, am i correct Can you provide the results of checking topic stats? ---- 2020-02-20 06:15:54 UTC - Ravi Shah: @Sijie Guo is there any way to increase backlog quota? ---- 2020-02-20 06:16:39 UTC - Devin G. Bost: Yes, it's in the docs ---- 2020-02-20 06:29:36 UTC - Addison Higham: so, wasn't able to find much in the heap dump I made yet, but, some code analysis makes me thing it is *likely* in the `org.apache.pulsar.broker.service.BrokerService#getTopic` call. Here is why I think that: • both creating a producer and calling `pulsar-admin topics stats` on the effected topic stalled out • looking at the code, both methods show that they call the authService and then call the `getTopic` method on the `BrokerService` • however, looking at the logs, I can see when we create a producer we get through the auth checks, as far as I can tell, that just leaves `BrokerService#getTopic` as the only common code path • a look at that code, I don't see anything obvious... but the most suspicious part is perhaps the calls in `pendingTopicLoadQueue` , it seems like there could be issues with acquiring the semaphore ---- 2020-02-20 06:45:58 UTC - Addison Higham: ah that semaphore has a pretty high count, so likely not ---- 2020-02-20 06:46:24 UTC - Devin G. Bost: > ah that semaphore has a pretty high count, so likely not Could you please explain? :slightly_smiling_face: ---- 2020-02-20 06:46:54 UTC - Devin G. Bost: Oh, you’re just saying that the semaphore is getting acquired just fine. ---- 2020-02-20 06:47:32 UTC - Addison Higham: `org.apache.pulsar.broker.ServiceConfiguration#maxConcurrentTopicLoadRequest` has a default of 5000, so unless you are loading more than 5000 topics at once, likely not an issue you would run into ---- 2020-02-20 06:48:44 UTC - Devin G. Bost: Interesting. ---- 2020-02-20 06:49:11 UTC - Devin G. Bost: Are you still thinking `BrokerService#getTopic` is suspicious, or are you looking elsewhere at this point? ---- 2020-02-20 06:50:18 UTC - Addison Higham: still seems likely, but that call goes pretty deep, so could be many things there, realistically I probably need to catch in the act again, I was dumb and didn't grab a thread dump ---- 2020-02-20 06:53:14 UTC - Devin G. Bost: Whenever we’ve had the freezing topic issue occur in prod, it’s usually generated enough mayhem that it’s been hard for us to capture good logs and other details around the event because usually everyone is too focused on just trying to get things back to a healthy state. ----