Slack digest for #dev - 2020-02-25

Apache Pulsar Slack Tue, 25 Feb 2020 01:11:58 -0800

2020-02-24 17:55:54 UTC - Ruwen: @Ruwen has joined the channel
----
2020-02-24 20:34:34 UTC - Greg Gallagher: @Greg Gallagher has joined the channel
----
2020-02-24 20:50:52 UTC - Addison Higham: okay, so I am going to open up some 
bug reports, but just an an FYI, even with 2.5.0 I am still seeing a number of 
issues when the following happens:
1. unload occurs
2. producers get disconnected
3. they try and re-connect back (via proxy)
4. bundle gets scheduled back to same broker
One issue we know for sure:
• producers can *sometimes* fail to make a producer with the "producer already 
exists" exception, looking at a heap dump, it appears we just end up in the 
`ServerCtx` with the producer future that just never resolves. This very much 
seems to be some issue with a race condition that happens during a producer 
re-connecting before the topic is totally ready again and results in a timeout 
in the initial reconnect
Another issue we think is related:
• we have had a few instances where we end up with a "frozen" topic, we haven't 
totally confirmed but I think it may also be related to unloads then going back 
to the same broker
----
2020-02-24 20:52:56 UTC - Devin G. Bost: @Addison Higham The “frozen” topic is 
one that I’ve been concerned about for a while.
----
2020-02-24 20:53:55 UTC - Addison Higham: if you have enough data from past 
incidents to see if an "Unload" happened prior to the topic being frozen, that 
would be super useful
----
2020-02-24 20:54:30 UTC - Devin G. Bost: We rolled back to 2.4.0 to stop the 
issue… The team has been afraid to upgrade due to it.
----
2020-02-24 20:54:46 UTC - Devin G. Bost: We need to finish creating an 
environment that we can use to test a rollout without anyone else depending on 
it.
----
2020-02-24 20:54:49 UTC - Addison Higham: should still be roughly the same
----
2020-02-24 20:55:09 UTC - Devin G. Bost: For some reason, the freezing topic 
issue was not as bad in 2.4.0. We don’t know why.
----
2020-02-24 20:55:14 UTC - Devin G. Bost: (at least compared to 2.4.2)
----
2020-02-24 20:55:18 UTC - Devin G. Bost: We haven’t tried 2.5.0
----
2020-02-24 20:55:36 UTC - Devin G. Bost: One other thing that helped was 
putting our ZK cluster on very fast storage.
----
2020-02-24 20:55:53 UTC - Devin G. Bost: That alone, however, wasn’t enough to 
create a stable system on 2.4.2.
----
2020-02-24 20:57:44 UTC - Addison Higham: yeah so I am not sure of what details 
would be, but if you just went through your logs and see if it occurs close to 
when an unload happens
----
2020-02-24 20:58:05 UTC - Addison Higham: which should still be present in 
2.4.x and 2.5.x
----
2020-02-24 20:58:35 UTC - Devin G. Bost: &gt; it appears we just end up in the 
`ServerCtx` with the producer future that just never resolves.
I wonder if that could be tied to the same underlying issue that I’m seeing 
while trying to fix the intermittent issue with 
<https://github.com/apache/pulsar/issues/6332>


There’s something odd going on between `ServerCnx` and `ClientCnx` . There’s 
some documentation that perhaps could be useful in that issue.
----
2020-02-24 21:03:22 UTC - Devin G. Bost: @Addison Higham I spoke too soon. I 
was told that my team has a new test environment running 2.5.0, and they 
haven’t seen the freezing topic issue there yet.
----
2020-02-24 21:03:45 UTC - Devin G. Bost: I wonder if there’s a particular 
configuration that makes the freezing topic issue worse.
----
2020-02-24 21:04:56 UTC - Addison Higham: is your load very heavy on a handful 
of topics? or is it pretty even across many topics? AFAICT, we see these issues 
primarily when we have load that isn't well balanced, which causes unloads in 
brokers
----
2020-02-24 21:05:27 UTC - Devin G. Bost: Yes, our load is very imbalanced. We 
have one topic that gets about 20k/second. Most of the others are far less than 
that.
----
2020-02-24 21:05:59 UTC - Devin G. Bost: What is the mechanism of broker 
unloads? I don’t know much about that.
----
2020-02-24 21:32:31 UTC - Addison Higham: topics are a ring hash, you "bundle" 
topics into N bundles per namespace. Bundles are what get mapped to a broker. 
Pulsar tracks how busy each broker is relative to other brokers and the bundles 
it is currently serving. Based on the CPU usage, brokers try to do "load 
shedding" by "unloading" a bundle, which then triggers pulsar to re-evaluate 
which system should be servicing the bundle. In uneven load cases, if you had 
one bundle that was consuming an entire broker, once that bundle is unloaded, 
it becomes a good candidate for the bundle and it gets handed right back to the 
same broker
----
2020-02-24 21:32:38 UTC - Addison Higham: see 
<http://pulsar.apache.org/docs/en/administration-load-balance/>
----
2020-02-24 21:35:49 UTC - Addison Higham: you want to design your topics (with 
partitioned topics, etc) so that topics are more evenly distributed across the 
brokers. In our case, this occurs when we are doing backfills of data. In 
steady state, we don't see issues because unloads are rare, our load is fairly 
well distributed. But when we are backfilling data, it places an imbalanced 
load and then see issues (which so far appears related to unloads)
----
2020-02-24 22:45:53 UTC - Sijie Guo: &gt; producers can *sometimes* fail to 
make a producer with the “producer already exists”
Are you using 2.5.0 clients? I think @Penghui Li fixed this issue before. 
Producer can be left on server side in many cases. (i.e. if createProducer 
timed out and application attempts to create producer again)
----
2020-02-24 22:57:36 UTC - Addison Higham: 2.5.0 server and I *believe* 2.5.0 
clients, but they are running inside a pulsar function. What I am seeing is 
that the `producerFuture` in `ServerCtx` never resolves to a value and just 
sits out in the hashmap of producerId -&gt; `producerFuture` for forever. Since 
`producerId` is provided by the client, all that is checked is if the 
`producerId` finds an existing `producerFuture` that isn't done (either in 
success or error state). Since the `producerFuture` never resolves, we have to 
completely disconnect (as that kills the `ServerCnx` ) or restart the broker
----
2020-02-24 23:19:44 UTC - Addison Higham: oh BTW, this can also happen where we 
have topic on broker A -&gt; unload to broker B -&gt; unload back to broker A 
and can still see the issue. But yes, more log diving reveals that the general 
pattern on the broker is that when we see this error message `Namespace bundle 
for topic 
(<persistent://canvas-cdc/prod-iad/cluster71-canvas.cluster67_shard_1072.planner_overrides>)
 not served by this instance. Please redo the lookup. Request is denied: 
namespace=canvas-cdc/prod-iad` the producerFuture isn't getting that exception, 
so it just sticks around unresolved
----
2020-02-24 23:20:25 UTC - Devin G. Bost: Interesting. I don’t think I’ve seen 
that error message before…
----
2020-02-25 04:58:46 UTC - Roman Popenov: I was able to build the project 
setting target as Java 11, I am working through the tests and fixing the ones 
that, but there are two tests that puzzle me 
`org.apache.pulsar.broker.admin.AdminApiSchemaValidationEnforced#testEnableSchemaValidationEnforcedHasSchemaMismatch`
 and 
`org.apache.pulsar.broker.admin.AdminApiTest2#brokerNamespaceIsolationPolicies`
----
2020-02-25 04:59:47 UTC - Roman Popenov: 
`org.apache.pulsar.broker.admin.AdminApiTest2#brokerNamespaceIsolationPolicies` 
is failing with error:
```23:56:44.374 
[pulsar-web-5299-16:org.eclipse.jetty.server.Slf4jRequestLogWriter@62] INFO  
org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Feb/2020:23:56:44 
-0500] "POST /admin/v2/clusters/use/namespaceIsolationPolicies/policy-1 
HTTP/1.1" 412 40 "-" "Pulsar-Java-v2.5.0" 3



org.apache.pulsar.client.admin.PulsarAdminException$PreconditionFailedException:
 Cluster use does not exist.

        at 
org.apache.pulsar.client.admin.internal.BaseResource.getApiException(BaseResource.java:220)
        at 
org.apache.pulsar.client.admin.internal.ClustersImpl.setNamespaceIsolationPolicy(ClustersImpl.java:177)
        at 
org.apache.pulsar.client.admin.internal.ClustersImpl.createNamespaceIsolationPolicy(ClustersImpl.java:156)
        at 
org.apache.pulsar.broker.admin.AdminApiTest2.brokerNamespaceIsolationPolicies(AdminApiTest2.java:885)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at 
org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:134)
        at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:597)
        at 
org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:173)
        at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
        at 
org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:816)
        at 
org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:146)
        at 
org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
        at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
        at org.testng.TestRunner.privateRun(TestRunner.java:766)
        at org.testng.TestRunner.run(TestRunner.java:587)
        at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
        at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
        at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
        at org.testng.SuiteRunner.run(SuiteRunner.java:286)
        at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
        at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
        at org.testng.TestNG.runSuitesSequentially(TestNG.java:1187)
        at org.testng.TestNG.runSuitesLocally(TestNG.java:1109)
        at org.testng.TestNG.runSuites(TestNG.java:1039)
        at org.testng.TestNG.run(TestNG.java:1007)
        at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66)
        at 
com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:110)
Caused by: javax.ws.rs.ClientErrorException: HTTP 412 Precondition Failed
        at 
org.glassfish.jersey.client.JerseyInvocation.createExceptionForFamily(JerseyInvocation.java:1122)
        at 
org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1105)
        at 
org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:883)
        at 
org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:767)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:316)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:298)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:229)
        at 
org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:414)
        at 
org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:765)
        at 
org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:456)
        at 
org.glassfish.jersey.client.JerseyInvocation$<http://Builder.post|Builder.post>(JerseyInvocation.java:357)
        at 
org.apache.pulsar.client.admin.internal.ClustersImpl.setNamespaceIsolationPolicy(ClustersImpl.java:174)
        ... 29 more```
----
2020-02-25 05:03:46 UTC - Roman Popenov: 
`org.apache.pulsar.broker.admin.AdminApiSchemaValidationEnforced#testEnableSchemaValidationEnforcedHasSchemaMismatch`
 is failing with this error
----
2020-02-25 05:18:50 UTC - Roman Popenov: Help and guidance would be appreciated.
----
2020-02-25 05:19:01 UTC - Roman Popenov: I am runnings the tests against 2.5.0
----
2020-02-25 05:56:18 UTC - Sijie Guo: @Roman Popenov - maybe try to run those 
tests individually and see what are the errors.
----
2020-02-25 07:58:58 UTC - Addison Higham: :phew: okay, finally got at least a 
subset of my research into a GH issue: 
<https://github.com/apache/pulsar/issues/6416>
tada : Devin G. Bost
+1 : Penghui Li
----
2020-02-25 08:04:50 UTC - Sijie Guo: /cc @Penghui Li
----
2020-02-25 08:06:28 UTC - Penghui Li: Ok, I will take a look
----

Slack digest for #dev - 2020-02-25

Reply via email to