2020-04-16 15:45:50 UTC - matt_innerspace.io: any chance we can get another docker image for `2.5.1 rc4`? Or instructions on how to do it? cc: @Addison Higham? ---- 2020-04-16 15:50:22 UTC - Addison Higham: building the docker images is *fairly* straightforward `mvn package -Pdocker -DskipTests` but it works best in a new clone of the repo, their tends to be a few droppings left behind that aren't properly cleaned up (the c++/python client build leaves around some files on disk that need to be cleaned up) ---- 2020-04-16 15:50:44 UTC - Addison Higham: that produces images with the `apachepulsar/pulsar-all` name, I just re-tag them and push ---- 2020-04-16 15:50:51 UTC - Addison Higham: but, I am kicking off a build right now :slightly_smiling_face: ---- 2020-04-16 15:51:00 UTC - Addison Higham: will push and let you know when it is done +1 : matt_innerspace.io ---- 2020-04-16 15:51:49 UTC - matt_innerspace.io: do you do the build in the `apachepulsar/pulsar-build` container? ---- 2020-04-16 16:22:12 UTC - Addison Higham: I don't run anything in there directly, the mvn process does use that container though, but I don't believe any java compilation happens there ---- 2020-04-16 17:53:03 UTC - Sijie Guo: @matt_innerspace.io you can try this `docker pull streamnative/pulsar-all:v2.5.1-candidate-4` +1 : matt_innerspace.io, Konstantinos Papalias ---- 2020-04-16 18:07:22 UTC - Addison Higham: woohoo! I was having issues with a build on mine that I got distracted on (though I think I did just figure it out) ---- 2020-04-16 22:47:39 UTC - Addison Higham: FYI, deployed rc4 to both beta and prod clusters, deployed to prod to see if it fixed the issue we are having with producers that we don't have a reliable repro for. So far so good. ---- 2020-04-17 00:12:22 UTC - Addison Higham: well... the good news, we don't see any regressions, but the patches I submitted that I *hoped* would fix my producers not getting cleaned up (resulting in a ton of "producer already exists errors") didn't work (surprisingly...) but it did give some more concrete data, thread here ---- 2020-04-17 00:13:06 UTC - Addison Higham: So, I added code to ensure that futures returned out of the broker service timeout, now I get this new error: ``` 23:24:11.077 [pulsar-io-24-3] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.9.127.254:55768] Failed to create topic <persistent://notification-service-prod/bridge/message-notification-center> pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Future didn't finish within deadline pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_242] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_242] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_242] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_242] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at org.apache.pulsar.broker.service.BrokerService.lambda$futureWithDeadline$18(BrokerService.java:740) ~[org.apache.pulsar-pulsar-broker-2.5.1.jar:2.5.1] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.45.Final.jar:4.1.45.Final] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) [io.netty-netty-common-4.1.45.Final.jar:4.1.45.Final] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty-netty-common-4.1.45.Final.jar:4.1.45.Final] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [io.netty-netty-common-4.1.45.Final.jar:4.1.45.Final] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [io.netty-netty-transport-native-epoll-4.1.45.Final-linux-x86_64.jar:4.1.45.Final] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty-netty-common-4.1.45.Final.jar:4.1.45.Final] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.45.Final.jar:4.1.45.Final] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.45.Final.jar:4.1.45.Final] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: Caused by: java.util.concurrent.TimeoutException: Future didn't finish within deadline pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at org.apache.pulsar.broker.service.BrokerService.futureWithDeadline(BrokerService.java:747) ~[org.apache.pulsar-pulsar-broker-2.5.1.jar:2.5.1] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at org.apache.pulsar.broker.service.BrokerService.loadOrCreatePersistentTopic(BrokerService.java:839) ~[org.apache.pulsar-pulsar-broker-2.5.1.jar:2.5.1] pulsar/pulsar-prod-broker-7899c598c8-ld7jx[broker]: at org.apache.pulsar.broker.service.BrokerService.lambda$getTopic$13(BrokerService.java:678) ~[org.apache.pulsar-pulsar-broker-2.5.1.jar:2.5.1]``` ---- 2020-04-17 00:13:55 UTC - Addison Higham: this confirms that ```BrokerService.loadOrCreatePersistentTopic``` ---- 2020-04-17 00:14:01 UTC - Addison Higham: is just stalling out for unknown reasons ---- 2020-04-17 00:15:14 UTC - Addison Higham: the extra mystery, in `ServerCnx.java:1091` which ultimately calls into `loadOrCreatePersistentTopic`, we have the following code that should ensure producers are cleaned up: ``` Throwable cause = exception.getCause(); if (!(cause instanceof ServiceUnitNotReadyException)) { // Do not print stack traces for expected exceptions log.error("[{}] Failed to create topic {}", remoteAddress, topicName, exception); }
// If client timed out, the future would have been completed // by subsequent close. Send error back to // client, only if not completed already. if (producerFuture.completeExceptionally(exception)) { ctx.writeAndFlush(Commands.newError(requestId, BrokerServiceException.getClientErrorCode(cause), cause.getMessage())); } producers.remove(producerId, producerFuture); return null;``` ---- 2020-04-17 04:56:24 UTC - Addison Higham: okay, so I think that this change <https://github.com/apache/pulsar/pull/6489> has introduce a minor regression for loading topics with many replicated clusters. The timeout is simply too low as it takes a while to load topics that need to load replication clusters. I have raised the timeout here: <https://github.com/apache/pulsar/pull/6750> which I think should be sufficient and reasonable, but long term, it seems like the brokerServer could use some refactoring such that it is more steps which each individually have a timeout. After some analysis, I do think this is helping in all other cases where see issues with producers (but need to observe a bit longer to be doubly sure) ---- 2020-04-17 06:47:29 UTC - Sijie Guo: @tuteng @Penghui Li ^^ ----