2020-06-26 16:24:10 UTC - Alan Kittel: @Alan Kittel has joined the channel ---- 2020-06-26 21:40:23 UTC - Devin G. Bost: Has anyone else run into issues with function load balancing with 2.5.2? ---- 2020-06-26 21:41:52 UTC - Jerry Peng: Can you provide some more detail on the issue? ---- 2020-06-26 21:44:33 UTC - Devin G. Bost: We’re deploying a new cluster, and all of the functions are running on a single broker. ---- 2020-06-26 21:48:14 UTC - Jerry Peng: That could be caused by many situations. Was the all the function workers fully up when you submitted your function? I would also check if there were subsequent worker failures that may cause functions to get re-scheduled.
BTW, I am also working on a mechanism to rebalance functions. ---- 2020-06-26 21:56:21 UTC - Devin G. Bost: In this cluster, the brokers are the function workers. We tried restarting the node, and it pushed all the functions to a different node. We also noticed another issue. One of the nodes is having all of the functions’ healthchecks fail. ---- 2020-06-26 21:58:23 UTC - Devin G. Bost: The healthchecks are only failing on that particular instance. I’m checking if I can find any differences in the configurations between this instance and the others. ---- 2020-06-26 22:03:23 UTC - Devin G. Bost: That’s awesome that you’re working on function rebalancing though. ---- 2020-06-26 22:05:40 UTC - Devin G. Bost: Here’s an example of one of those healthcheck failures: ```2020-06-26T21:47:00,373 [function-timer-thread-60-1] ERROR org.apache.pulsar.functions.runtime.process.ProcessRuntime - Health check failed for pla-record-sink-0``` ---- 2020-06-26 22:07:40 UTC - Devin G. Bost: I wonder if they’re separate issues. ---- 2020-06-26 22:25:33 UTC - Devin G. Bost: We’re also getting a healthcheck failure on all the functions on a broker with 2.5.2. We spun up a new broker and shut off the other one, and the problem just seems to move to another broker. ---- 2020-06-26 22:38:47 UTC - Devin G. Bost: If we shut off all the problematic brokers, it seems we can get a healthy cluster, but now we’re down to just 3 brokers on that cluster… ---- 2020-06-26 22:39:59 UTC - Devin G. Bost: Has anyone changed the healthcheck code recently? Just wondering where to look to investigate this… ---- 2020-06-26 22:41:56 UTC - Jerry Peng: @Devin G. Bost <https://github.com/apache/pulsar/blob/master/pulsar-functions/runtime/src/main/java/org/apache/pulsar/functions/runtime/process/ProcessRuntime.java#L170> ---- 2020-06-26 22:42:11 UTC - Jerry Peng: That logic haven't changed for a while ---- 2020-06-26 22:42:24 UTC - Devin G. Bost: That’s what’s so weird about this issue… ---- 2020-06-26 22:43:03 UTC - Jerry Peng: Though that logic is only responsible for restarting dead processes and doesn't control the scheduling of functions to workers +1 : Devin G. Bost ---- 2020-06-26 22:43:37 UTC - Devin G. Bost: Have we upgraded any of the gRPC versions? ---- 2020-06-26 22:44:15 UTC - Devin G. Bost: I’m somewhat familiar with the heartbeat logic because I contributed that feature for Go functions for 2.6.0 ----