2020-02-24 17:55:54 UTC - Ruwen: @Ruwen has joined the channel ---- 2020-02-24 20:34:34 UTC - Greg Gallagher: @Greg Gallagher has joined the channel ---- 2020-02-24 20:50:52 UTC - Addison Higham: okay, so I am going to open up some bug reports, but just an an FYI, even with 2.5.0 I am still seeing a number of issues when the following happens: 1. unload occurs 2. producers get disconnected 3. they try and re-connect back (via proxy) 4. bundle gets scheduled back to same broker One issue we know for sure: • producers can *sometimes* fail to make a producer with the "producer already exists" exception, looking at a heap dump, it appears we just end up in the `ServerCtx` with the producer future that just never resolves. This very much seems to be some issue with a race condition that happens during a producer re-connecting before the topic is totally ready again and results in a timeout in the initial reconnect Another issue we think is related: • we have had a few instances where we end up with a "frozen" topic, we haven't totally confirmed but I think it may also be related to unloads then going back to the same broker ---- 2020-02-24 20:52:56 UTC - Devin G. Bost: @Addison Higham The “frozen” topic is one that I’ve been concerned about for a while. ---- 2020-02-24 20:53:55 UTC - Addison Higham: if you have enough data from past incidents to see if an "Unload" happened prior to the topic being frozen, that would be super useful ---- 2020-02-24 20:54:30 UTC - Devin G. Bost: We rolled back to 2.4.0 to stop the issue… The team has been afraid to upgrade due to it. ---- 2020-02-24 20:54:46 UTC - Devin G. Bost: We need to finish creating an environment that we can use to test a rollout without anyone else depending on it. ---- 2020-02-24 20:54:49 UTC - Addison Higham: should still be roughly the same ---- 2020-02-24 20:55:09 UTC - Devin G. Bost: For some reason, the freezing topic issue was not as bad in 2.4.0. We don’t know why. ---- 2020-02-24 20:55:14 UTC - Devin G. Bost: (at least compared to 2.4.2) ---- 2020-02-24 20:55:18 UTC - Devin G. Bost: We haven’t tried 2.5.0 ---- 2020-02-24 20:55:36 UTC - Devin G. Bost: One other thing that helped was putting our ZK cluster on very fast storage. ---- 2020-02-24 20:55:53 UTC - Devin G. Bost: That alone, however, wasn’t enough to create a stable system on 2.4.2. ---- 2020-02-24 20:57:44 UTC - Addison Higham: yeah so I am not sure of what details would be, but if you just went through your logs and see if it occurs close to when an unload happens ---- 2020-02-24 20:58:05 UTC - Addison Higham: which should still be present in 2.4.x and 2.5.x ---- 2020-02-24 20:58:35 UTC - Devin G. Bost: > it appears we just end up in the `ServerCtx` with the producer future that just never resolves. I wonder if that could be tied to the same underlying issue that I’m seeing while trying to fix the intermittent issue with <https://github.com/apache/pulsar/issues/6332>
There’s something odd going on between `ServerCnx` and `ClientCnx` . There’s some documentation that perhaps could be useful in that issue. ---- 2020-02-24 21:03:22 UTC - Devin G. Bost: @Addison Higham I spoke too soon. I was told that my team has a new test environment running 2.5.0, and they haven’t seen the freezing topic issue there yet. ---- 2020-02-24 21:03:45 UTC - Devin G. Bost: I wonder if there’s a particular configuration that makes the freezing topic issue worse. ---- 2020-02-24 21:04:56 UTC - Addison Higham: is your load very heavy on a handful of topics? or is it pretty even across many topics? AFAICT, we see these issues primarily when we have load that isn't well balanced, which causes unloads in brokers ---- 2020-02-24 21:05:27 UTC - Devin G. Bost: Yes, our load is very imbalanced. We have one topic that gets about 20k/second. Most of the others are far less than that. ---- 2020-02-24 21:05:59 UTC - Devin G. Bost: What is the mechanism of broker unloads? I don’t know much about that. ---- 2020-02-24 21:32:31 UTC - Addison Higham: topics are a ring hash, you "bundle" topics into N bundles per namespace. Bundles are what get mapped to a broker. Pulsar tracks how busy each broker is relative to other brokers and the bundles it is currently serving. Based on the CPU usage, brokers try to do "load shedding" by "unloading" a bundle, which then triggers pulsar to re-evaluate which system should be servicing the bundle. In uneven load cases, if you had one bundle that was consuming an entire broker, once that bundle is unloaded, it becomes a good candidate for the bundle and it gets handed right back to the same broker ---- 2020-02-24 21:32:38 UTC - Addison Higham: see <http://pulsar.apache.org/docs/en/administration-load-balance/> ---- 2020-02-24 21:35:49 UTC - Addison Higham: you want to design your topics (with partitioned topics, etc) so that topics are more evenly distributed across the brokers. In our case, this occurs when we are doing backfills of data. In steady state, we don't see issues because unloads are rare, our load is fairly well distributed. But when we are backfilling data, it places an imbalanced load and then see issues (which so far appears related to unloads) ---- 2020-02-24 22:45:53 UTC - Sijie Guo: > producers can *sometimes* fail to make a producer with the “producer already exists” Are you using 2.5.0 clients? I think @Penghui Li fixed this issue before. Producer can be left on server side in many cases. (i.e. if createProducer timed out and application attempts to create producer again) ---- 2020-02-24 22:57:36 UTC - Addison Higham: 2.5.0 server and I *believe* 2.5.0 clients, but they are running inside a pulsar function. What I am seeing is that the `producerFuture` in `ServerCtx` never resolves to a value and just sits out in the hashmap of producerId -> `producerFuture` for forever. Since `producerId` is provided by the client, all that is checked is if the `producerId` finds an existing `producerFuture` that isn't done (either in success or error state). Since the `producerFuture` never resolves, we have to completely disconnect (as that kills the `ServerCnx` ) or restart the broker ---- 2020-02-24 23:19:44 UTC - Addison Higham: oh BTW, this can also happen where we have topic on broker A -> unload to broker B -> unload back to broker A and can still see the issue. But yes, more log diving reveals that the general pattern on the broker is that when we see this error message `Namespace bundle for topic (<persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.planner_overrides>) not served by this instance. Please redo the lookup. Request is denied: namespace=canvas-cdc/prod-iad` the producerFuture isn't getting that exception, so it just sticks around unresolved ---- 2020-02-24 23:20:25 UTC - Devin G. Bost: Interesting. I don’t think I’ve seen that error message before… ---- 2020-02-25 04:58:46 UTC - Roman Popenov: I was able to build the project setting target as Java 11, I am working through the tests and fixing the ones that, but there are two tests that puzzle me `org.apache.pulsar.broker.admin.AdminApiSchemaValidationEnforced#testEnableSchemaValidationEnforcedHasSchemaMismatch` and `org.apache.pulsar.broker.admin.AdminApiTest2#brokerNamespaceIsolationPolicies` ---- 2020-02-25 04:59:47 UTC - Roman Popenov: `org.apache.pulsar.broker.admin.AdminApiTest2#brokerNamespaceIsolationPolicies` is failing with error: ```23:56:44.374 [pulsar-web-5299-16:org.eclipse.jetty.server.Slf4jRequestLogWriter@62] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Feb/2020:23:56:44 -0500] "POST /admin/v2/clusters/use/namespaceIsolationPolicies/policy-1 HTTP/1.1" 412 40 "-" "Pulsar-Java-v2.5.0" 3 org.apache.pulsar.client.admin.PulsarAdminException$PreconditionFailedException: Cluster use does not exist. at org.apache.pulsar.client.admin.internal.BaseResource.getApiException(BaseResource.java:220) at org.apache.pulsar.client.admin.internal.ClustersImpl.setNamespaceIsolationPolicy(ClustersImpl.java:177) at org.apache.pulsar.client.admin.internal.ClustersImpl.createNamespaceIsolationPolicy(ClustersImpl.java:156) at org.apache.pulsar.broker.admin.AdminApiTest2.brokerNamespaceIsolationPolicies(AdminApiTest2.java:885) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:134) at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:597) at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:173) at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46) at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:816) at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:146) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128) at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) at org.testng.TestRunner.privateRun(TestRunner.java:766) at org.testng.TestRunner.run(TestRunner.java:587) at org.testng.SuiteRunner.runTest(SuiteRunner.java:384) at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378) at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337) at org.testng.SuiteRunner.run(SuiteRunner.java:286) at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53) at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96) at org.testng.TestNG.runSuitesSequentially(TestNG.java:1187) at org.testng.TestNG.runSuitesLocally(TestNG.java:1109) at org.testng.TestNG.runSuites(TestNG.java:1039) at org.testng.TestNG.run(TestNG.java:1007) at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66) at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:110) Caused by: javax.ws.rs.ClientErrorException: HTTP 412 Precondition Failed at org.glassfish.jersey.client.JerseyInvocation.createExceptionForFamily(JerseyInvocation.java:1122) at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1105) at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:883) at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:767) at org.glassfish.jersey.internal.Errors.process(Errors.java:316) at org.glassfish.jersey.internal.Errors.process(Errors.java:298) at org.glassfish.jersey.internal.Errors.process(Errors.java:229) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:414) at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:765) at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:456) at org.glassfish.jersey.client.JerseyInvocation$<http://Builder.post|Builder.post>(JerseyInvocation.java:357) at org.apache.pulsar.client.admin.internal.ClustersImpl.setNamespaceIsolationPolicy(ClustersImpl.java:174) ... 29 more``` ---- 2020-02-25 05:03:46 UTC - Roman Popenov: `org.apache.pulsar.broker.admin.AdminApiSchemaValidationEnforced#testEnableSchemaValidationEnforcedHasSchemaMismatch` is failing with this error ---- 2020-02-25 05:18:50 UTC - Roman Popenov: Help and guidance would be appreciated. ---- 2020-02-25 05:19:01 UTC - Roman Popenov: I am runnings the tests against 2.5.0 ---- 2020-02-25 05:56:18 UTC - Sijie Guo: @Roman Popenov - maybe try to run those tests individually and see what are the errors. ---- 2020-02-25 07:58:58 UTC - Addison Higham: :phew: okay, finally got at least a subset of my research into a GH issue: <https://github.com/apache/pulsar/issues/6416> tada : Devin G. Bost +1 : Penghui Li ---- 2020-02-25 08:04:50 UTC - Sijie Guo: /cc @Penghui Li ---- 2020-02-25 08:06:28 UTC - Penghui Li: Ok, I will take a look ----