Slack digest for #dev - 2020-04-24

Apache Pulsar Slack Fri, 24 Apr 2020 02:11:43 -0700

2020-04-23 14:43:20 UTC - Jianfeng Qiao: Anyone know, why op.addOpCount is 
assigned twice in create() and initiate() of OpAddEntry.java?
----
2020-04-23 15:22:03 UTC - Addison Higham: is there a doc on how to retrigger 
github tests
----
2020-04-23 15:23:29 UTC - Addison Higham: is it just `/pulsarbot 
run-failure-checks`?
----
2020-04-23 15:34:17 UTC - Yuvaraj Loganathan: Yes
----
2020-04-23 16:01:47 UTC - Patrik Kleindl: @Patrik Kleindl has joined the channel
----
2020-04-23 16:35:17 UTC - Sijie Guo: 
<https://github.com/apache/pulsar-test-infra/blob/master/pulsarbot/README.md>
----
2020-04-23 16:43:37 UTC - Addison Higham: one of the biggest remaining issues I 
know see when running in k8s:
- a bookie is lost/replaced (I mostly see this when trying to do bookie 
maintainance/restarts)
- bookie rescheduled
- broker never gets notified of new bookie
- broker doesn't have enough healthy bookies to form a new ensemble


I added <https://github.com/apache/pulsar/pull/6800> and plan on setting the 
interval to be like every 60 seconds. Doesn't seem extreme. Obviously the real 
question though: what is going wrong such that either the ZK watch isn't firing 
or is the ZK watch just be dropped somewhere? ZK watches seem critical to the 
health of pulsar, so my biggest concern is if it is something systematic, 
perhaps I have larger issues.
----
2020-04-23 16:47:56 UTC - Addison Higham: @Sijie Guo ^^  I watched your pulsar 
on k8s youtube video so I wonder if you have some context here, have you 
observed this problem?
----
2020-04-23 16:51:23 UTC - Chris Bartholomew: @Addison Higham I have definitely 
seen this, but have not figured out exactly why. Restarting the broker clears 
the problem.
----
2020-04-23 16:53:58 UTC - Addison Higham: yeah, I think I know now how to work 
around it also in an automated fashion, but it does raise for me some concerns 
that it could be a more systematic issue
----
2020-04-23 16:54:25 UTC - Chris Bartholomew: I run broker health checks as 
liveness probes to workaround this
----
2020-04-23 16:56:33 UTC - Addison Higham: yeah, that is what I am doing now as 
well, but still don't like how much downtime that can be, so going to see if I 
can use that getBookieInfo change to drop the  time down more
----
2020-04-23 16:57:22 UTC - Addison Higham: (my biggest annoyance with broker 
checks as liveness probes is it makes the logs so dang noisy... but that is a 
whole other problem, need to get pulsar doing more structured logging)
----
2020-04-23 17:19:45 UTC - Sijie Guo: @Addison Higham:

Did you see any errors in the broker log? One of the possibilities is from this 
issue - <https://github.com/apache/bookkeeper/pull/2301>

Since bookkeeper is deployed using statefulset, the pod DNS is only resolvable 
when the pod is ready (if you have readiness probe for bookie pod, it will wait 
until the readiness succeed). So in some cases, you will see NPE when resolving 
network address and cause bookies are not added to network topologies.

&gt; my biggest annoyance with broker checks as liveness probes
I usually don’t recommend using broker health check as liveness probes. As it 
can potentially bring down the whole cluster. The liveness of a broker 
shouldn’t depend on the health of an entire bookkeeper cluster. With that being 
said, broker should stand up even bookkeeper cluster is not writable.
----
2020-04-23 17:25:19 UTC - Addison Higham: aha, yeah so I did add a readiness 
probe as well, I will go dig in and see if I see that error. As far as broker 
with liveness probe, I also thought it might not be ideal, but until I can have 
the brokers better self heal, at least it gets me back to being able to serve 
writes.
----
2020-04-23 17:31:21 UTC - Addison Higham: @Sijie Guo have you tried using 
`publishNotReadyAddresses` on the service to fix that issue?
----
2020-04-23 18:51:48 UTC - Addison Higham: huh so `publishNotReadyAddresses` 
that exception, but it didn't fix the issue because of a cached DNS entry, it 
kept trying to hit the old IP
----
2020-04-23 19:28:03 UTC - matt_innerspace.io: 2.5.1 possible bug?  Seems there 
was a change in the 
`<https://pulsar.apache.org/admin/v3/functions/{tenant}/{namespace}/{functionName}>`
 function, where the method signature changed, specifically the 
`functionConfig` parameter, which switched from a String (in 2.4.0) to an 
object (2.5.1) in `FunctionsApiV3Resource.java` as shown below in 2.5.1:
```    @POST
    @Path("/{tenant}/{namespace}/{functionName}")
    @Consumes(MediaType.MULTIPART_FORM_DATA)
    public void registerFunction(final @PathParam("tenant") String tenant,
                                 final @PathParam("namespace") String namespace,
                                 final @PathParam("functionName") String 
functionName,
                                 final @FormDataParam("data") InputStream 
uploadedInputStream,
                                 final @FormDataParam("data") 
FormDataContentDisposition fileDetail,
                                 final @FormDataParam("url") String 
functionPkgUrl,
                                 final @FormDataParam("functionConfig") 
FunctionConfig functionConfig) {```
POSTs that worked before now return `400, Bad Request, b'{"reason":"Function 
config is not provided"}'`

The result of this is it's seemingly impossible to register a function via the 
REST API (from python).  Perhaps I'm missing something?
----
2020-04-23 19:43:40 UTC - matt_innerspace.io: logged here - 
<https://github.com/apache/pulsar/issues/6809>
----
2020-04-23 20:11:16 UTC - Devin G. Bost: What’s the most graceful way to handle 
error messages from a sink?
Wanting opinions.
----
2020-04-23 20:13:21 UTC - Sijie Guo: `publishNotReadyAddresses` is one 
solution. You can tune the java dns cache ttl time.
----
2020-04-23 22:10:08 UTC - Sijie Guo: it is `string` in v2 admin.
----
2020-04-23 22:10:39 UTC - Sijie Guo: it is changed to FunctionConfig in v3 
endpoint
----
2020-04-23 22:10:55 UTC - Sijie Guo: v2 or v3 is not related to pulsar versions.
----
2020-04-23 22:11:05 UTC - Sijie Guo: it is the version for http endpoint
----
2020-04-23 22:26:51 UTC - Sijie Guo: I replied here 
<https://github.com/apache/pulsar/issues/6809#issuecomment-618703430>
----

Slack digest for #dev - 2020-04-24

Reply via email to