Considerations for Function Mesh PIP-66 with Geo-Replication

Devin Bost Wed, 05 Aug 2020 17:04:39 -0700

It occurred to me that we don't yet have a great solution for replicating
function workloads when running Pulsar with geo-replication.


There are a couple of concerns that I mention below, addressing failover
and varying function behavior.

1. Failover. There are challenges with running functions in multiple
clusters concurrently. If a replicated topic is producing to another
replicated topic, replicating a function would duplicate messages at best,
and in the worst case, could cause major problems, such as duplicating
transactions to a database or creating race conditions. (Consider the case
of processing a customer purchase transaction.) Of course, it could be
argued that downstream services must be idempotent if consuming from Pulsar
(especially if on a geo-replicated topic), but it could still multiply the
load on those external systems, which would be a problem during peak or
bursty traffic. Ideally, there would be a way to start Pulsar functions in
a failover environment once a failover condition occurred. For example,
let's say a Pulsar cluster in us-west failed. Then, the Pulsar functions
that were replicated from us-west to us-east in an idle (stopped) mode
would be triggered to start in us-east. This capability would ensure that
existing function workloads would continue to operate despite a failure in
one cluster.
In this case, we'd want functions to have the same behavior in each
cluster. For example, let's say we have a function dependency on a web
service that uses a single global endpoint (such as behind a global load
balancer). In this case, if the function was running concurrently in both
clusters, it could cause problems. However, we wouldn't need the function
instances to have different configurations in the different clusters as
long as they could be started up once Pulsar failover was triggered.

2. Varying function behavior. In other cases, replicating functions across
environments would necessitate varying their configurations. For example,
let's say we have two environments, us-east and us-west. Let's say we have
a Pulsar cluster in us-east and a Pulsar cluster in us-west, and let's say
we have a separate Apache Ignite cluster in each environment. When
replicating functions from us-west to us-east, we don't want the functions
in both environments to both point to the Ignite cluster in us-west; we
want the functions running in us-east to point to the Ignite cluster in
us-east. This kind of configuration would enable true replication of
function workloads with other geo-replicated services. For example, perhaps
we only want users accessing a website in the us-east region to hit the
Ignite cache running in a cluster in us-east with the Pulsar cluster in
us-east. We'd need a way to configure that for a geo-replication enabled
function mesh.

I'd like to start a discussion to hear thoughts on how the function mesh
could work with geo-replication or if we'd still need to keep these
features completely separate. It seems that there are challenges with
operating Pulsar with either approach. Also, if anyone has any thoughts on
best practices for managing function workloads in a geo-replicated
environment (such as for failover), that would be helpful to facilitate the
discussion as well.

Thanks,

Devin G. Bost
Cell: (503) 473-1773

Considerations for Function Mesh PIP-66 with Geo-Replication

Reply via email to