2020-06-26 16:24:10 UTC - Alan Kittel: @Alan Kittel has joined the channel
----
2020-06-26 21:40:23 UTC - Devin G. Bost: Has anyone else run into issues with 
function load balancing with 2.5.2?
----
2020-06-26 21:41:52 UTC - Jerry Peng: Can you provide some more detail on the 
issue?
----
2020-06-26 21:44:33 UTC - Devin G. Bost: We’re deploying a new cluster, and all 
of the functions are running on a single broker.
----
2020-06-26 21:48:14 UTC - Jerry Peng: That could be caused by many situations.  
Was the all the function workers fully up when you submitted your function?  I 
would also check if there were subsequent worker failures that may cause 
functions to get re-scheduled.

BTW, I am also working on a mechanism to rebalance functions.
----
2020-06-26 21:56:21 UTC - Devin G. Bost: In this cluster, the brokers are the 
function workers.

We tried restarting the node, and it pushed all the functions to a different 
node.

We also noticed another issue. One of the nodes is having all of the functions’ 
healthchecks fail.
----
2020-06-26 21:58:23 UTC - Devin G. Bost: The healthchecks are only failing on 
that particular instance.

I’m checking if I can find any differences in the configurations between this 
instance and the others.
----
2020-06-26 22:03:23 UTC - Devin G. Bost: That’s awesome that you’re working on 
function rebalancing though.
----
2020-06-26 22:05:40 UTC - Devin G. Bost: Here’s an example of one of those 
healthcheck failures:
```2020-06-26T21:47:00,373 [function-timer-thread-60-1] ERROR 
org.apache.pulsar.functions.runtime.process.ProcessRuntime - Health check 
failed for pla-record-sink-0```
----
2020-06-26 22:07:40 UTC - Devin G. Bost: I wonder if they’re separate issues.
----
2020-06-26 22:25:33 UTC - Devin G. Bost: We’re also getting a healthcheck 
failure on all the functions on a broker with 2.5.2. We spun up a new broker 
and shut off the other one, and the problem just seems to move to another 
broker.
----
2020-06-26 22:38:47 UTC - Devin G. Bost: If we shut off all the problematic 
brokers, it seems we can get a healthy cluster, but now we’re down to just 3 
brokers on that cluster…
----
2020-06-26 22:39:59 UTC - Devin G. Bost: Has anyone changed the healthcheck 
code recently? Just wondering where to look to investigate this…
----
2020-06-26 22:41:56 UTC - Jerry Peng: @Devin G. Bost 
<https://github.com/apache/pulsar/blob/master/pulsar-functions/runtime/src/main/java/org/apache/pulsar/functions/runtime/process/ProcessRuntime.java#L170>
----
2020-06-26 22:42:11 UTC - Jerry Peng: That logic haven't changed for a while
----
2020-06-26 22:42:24 UTC - Devin G. Bost: That’s what’s so weird about this 
issue…
----
2020-06-26 22:43:03 UTC - Jerry Peng: Though that logic is only responsible for 
restarting dead processes and doesn't control the scheduling of functions to 
workers
+1 : Devin G. Bost
----
2020-06-26 22:43:37 UTC - Devin G. Bost: Have we upgraded any of the gRPC 
versions?
----
2020-06-26 22:44:15 UTC - Devin G. Bost: I’m somewhat familiar with the 
heartbeat logic because I contributed that feature for Go functions for 2.6.0
----

Reply via email to