Hello,
With Pulsar 2.8.0 we have the Exclusive Producer, which allows you to use
Pulsar as a consistent write-ahead-log for replicated state machines.

It already happened to me a couple of times to need to build some
replicated state storage on top of Pulsar and I would like to share some
thoughts.

We can provide some simple built-in mechanism to share some "state"  across
several instances of an application without adding some Database or other
components to the architecture:
- metadata
- dynamic configuration
- task assignments
- key-value database

In general we can provide an API to handle a shared distributed Java
Object: each client can access the Object and mutate the State,
ensuring consistency.

I have drafted a small API to build such an abstraction:

public interface PulsarDatabase<V, O> {

    /**
     * Read from the current state.
     * @param reader a function that accesses current state and returns a
value
     * @param latest ensure that the value is the latest
     * @return an handle to the result of the operation
     */
    <K> CompletableFuture<K> read(Function<V, K> reader, boolean latest);

    /*
     * Execute a mutation on the state.
     * The operationsGenerator generates a list of mutations to be
     * written to the log, the operationApplier function
     * is executed to mutate the state after each successful write
     * to the log. Finally the reader function can read from
     * the current status before releasing the write lock.
     * @param operationsGenerator generates a list of mutations
     * @param operationApplier apply each mutation to the current state
     * @param reader read from the status while inside the write lock
     * @param <K> the returned data type
     * @param <O> the operation type
     * @return a handle to the completion of the operation
     */
    <K> CompletableFuture<K> write(Function<V, List<O>> operationsGenerator,
                                     Function<V, K> reader);
}

Using this simple abstraction it is easy to build for instance a
distributed Java "Map" like this
https://github.com/eolivelli/pulsar-db/blob/main/src/main/java/org/apache/pulsar/db/PulsarMap.java


I believe that we should add this feature to the Pulsar Client API,
maybe we can start by adding this in the pulsar-adapters module as it can
be loosely coupled with the core Pulsar Client

Building distributed data structures on top of that API is simple,
but the underlying implementation of the core APi is not straightforward,
because there are many
edge cases to deal with.

If we provide some recipes that are available out-of-the-box we will
unleash the secret power
of Exclusive producer and we will allow more applications to migrate to
Pulsar or to choose Pulsar as storage backbone.

You can find the code here https://github.com/eolivelli/pulsar-db, it is
only a proof-of-concept, but it is already usable.

If there is an interest in this I will be happy to draft a PIP
and also to send the implementation to the pulsar-adapters repository.

Best regards

Enrico

Reply via email to