It occurred to me that we don't yet have a great solution for replicating function workloads when running Pulsar with geo-replication.
There are a couple of concerns that I mention below, addressing failover and varying function behavior. 1. Failover. There are challenges with running functions in multiple clusters concurrently. If a replicated topic is producing to another replicated topic, replicating a function would duplicate messages at best, and in the worst case, could cause major problems, such as duplicating transactions to a database or creating race conditions. (Consider the case of processing a customer purchase transaction.) Of course, it could be argued that downstream services must be idempotent if consuming from Pulsar (especially if on a geo-replicated topic), but it could still multiply the load on those external systems, which would be a problem during peak or bursty traffic. Ideally, there would be a way to start Pulsar functions in a failover environment once a failover condition occurred. For example, let's say a Pulsar cluster in us-west failed. Then, the Pulsar functions that were replicated from us-west to us-east in an idle (stopped) mode would be triggered to start in us-east. This capability would ensure that existing function workloads would continue to operate despite a failure in one cluster. In this case, we'd want functions to have the same behavior in each cluster. For example, let's say we have a function dependency on a web service that uses a single global endpoint (such as behind a global load balancer). In this case, if the function was running concurrently in both clusters, it could cause problems. However, we wouldn't need the function instances to have different configurations in the different clusters as long as they could be started up once Pulsar failover was triggered. 2. Varying function behavior. In other cases, replicating functions across environments would necessitate varying their configurations. For example, let's say we have two environments, us-east and us-west. Let's say we have a Pulsar cluster in us-east and a Pulsar cluster in us-west, and let's say we have a separate Apache Ignite cluster in each environment. When replicating functions from us-west to us-east, we don't want the functions in both environments to both point to the Ignite cluster in us-west; we want the functions running in us-east to point to the Ignite cluster in us-east. This kind of configuration would enable true replication of function workloads with other geo-replicated services. For example, perhaps we only want users accessing a website in the us-east region to hit the Ignite cache running in a cluster in us-east with the Pulsar cluster in us-east. We'd need a way to configure that for a geo-replication enabled function mesh. I'd like to start a discussion to hear thoughts on how the function mesh could work with geo-replication or if we'd still need to keep these features completely separate. It seems that there are challenges with operating Pulsar with either approach. Also, if anyone has any thoughts on best practices for managing function workloads in a geo-replicated environment (such as for failover), that would be helpful to facilitate the discussion as well. Thanks, Devin G. Bost Cell: (503) 473-1773