Re: Enhancing the Pulsar Client with some a Shared State API

Enrico Olivelli Mon, 29 Nov 2021 08:48:45 -0800

I am going to put up a PIP this week or the week after

Enrico


Il giorno lun 15 nov 2021 alle ore 13:00 Enrico Olivelli <
eolive...@gmail.com> ha scritto:

> Jack,
>
> Il giorno lun 15 nov 2021 alle ore 10:37 Jack Vanlightly
> <jvanligh...@splunk.com.invalid> ha scritto:
>
>> Hi Enrico,
>>
>> This is interesting. I'm thinking of the KTables part of Kafka Streams and
>> also Raft state machines.
>>
>> You could build something equivalent to a Raft state machine on top of
>> Pulsar where WaitForExclusive acts as leader election and the topic as the
>> log.
>
>
> Yes, that's the idea.
>
>
>> I would be interested in a PIP, that also includes considerations for
>> things like:
>>
>>    - how applications can persist an applied index to avoid having to
>>    rebuild state from scratch
>>    - how to manage topic size such that it isn't easy for users to end up
>>    with inconsistent views of the state. Things like checkpointing, may be
>>    having more control over topic data retention.
>>
>> Also how does this compare to PIP-104 Table View? (
>>
>> http://mail-archives.apache.org/mod_mbox/pulsar-dev/202110.mbox/%3cCA+JmKXbGf3CMy5dxypX=so-gyxrdh0pmttkhy3wnnm4azgw...@mail.gmail.com%3e
>> )
>>
>
> This is partially related but the intent is different.
>
> I don't want to create a "database" using Pulsar (like ksqlDB is for
> Kafka), that would be interesting but that's not the scope of this feature.
>
> I saw few times that you need some shared state among the instances of
> your application, and unfortunately you end up in adding some additional
> component, like a small external SQL DB (like PostGRE)
> or something like HerdDB (that is still a SQL DB but can store data on
> BookKeeper and also can run inside the same JVM of the applications).
>
> We already have something like this to manage Pulsar Functions
> assignments, or in Kafka Connect Adapter to emulate consumer group
> management.
> I came to implement this in a bunch of other projects.
>
> We need in Pulsar something that is very lightweight but that allows you
> to synchronize some state among your clients.
>
> A similar tool is present in Pravega.io  with the State Synchronizers
> abstraction
> https://pravega.io/docs/v0.7.1/state-synchronizer-design/
>
> This is why I don't think that initially we need some mechanism to store
> the data locally.
> But that's something that we can provide, one step at a time.
>
>
>
>>
>> Finally, have we looked at what the market is asking for? What kind of
>> product strategy do we have regarding Pulsar?
>
>
> AFAIK The is no "strategy", the project is growing thanks to the
> contributions of users
>
>
>> This kind of thing could end
>> up being highly valuable but also a big investment and it would be good to
>> know how the community is steering Pulsar to make it the most relevant it
>> can be.
>>
>
> I don't want to add a "Pulsar backed Database", in order to do so it would
> be better to spin off a new project (or subproject)
>
>
>
> Enrico
>
>
>>
>> Jack
>>
>>
>> On Wed, Nov 10, 2021 at 5:18 PM Enrico Olivelli <eolive...@gmail.com>
>> wrote:
>>
>> > [ External sender. Exercise caution. ]
>> >
>> > Hello,
>> > With Pulsar 2.8.0 we have the Exclusive Producer, which allows you to
>> use
>> > Pulsar as a consistent write-ahead-log for replicated state machines.
>> >
>> > It already happened to me a couple of times to need to build some
>> > replicated state storage on top of Pulsar and I would like to share some
>> > thoughts.
>> >
>> > We can provide some simple built-in mechanism to share some "state"
>> across
>> > several instances of an application without adding some Database or
>> other
>> > components to the architecture:
>> > - metadata
>> > - dynamic configuration
>> > - task assignments
>> > - key-value database
>> >
>> > In general we can provide an API to handle a shared distributed Java
>> > Object: each client can access the Object and mutate the State,
>> > ensuring consistency.
>> >
>> > I have drafted a small API to build such an abstraction:
>> >
>> > public interface PulsarDatabase<V, O> {
>> >
>> >     /**
>> >      * Read from the current state.
>> >      * @param reader a function that accesses current state and returns
>> a
>> > value
>> >      * @param latest ensure that the value is the latest
>> >      * @return an handle to the result of the operation
>> >      */
>> >     <K> CompletableFuture<K> read(Function<V, K> reader, boolean
>> latest);
>> >
>> >     /*
>> >      * Execute a mutation on the state.
>> >      * The operationsGenerator generates a list of mutations to be
>> >      * written to the log, the operationApplier function
>> >      * is executed to mutate the state after each successful write
>> >      * to the log. Finally the reader function can read from
>> >      * the current status before releasing the write lock.
>> >      * @param operationsGenerator generates a list of mutations
>> >      * @param operationApplier apply each mutation to the current state
>> >      * @param reader read from the status while inside the write lock
>> >      * @param <K> the returned data type
>> >      * @param <O> the operation type
>> >      * @return a handle to the completion of the operation
>> >      */
>> >     <K> CompletableFuture<K> write(Function<V, List<O>>
>> > operationsGenerator,
>> >                                      Function<V, K> reader);
>> > }
>> >
>> > Using this simple abstraction it is easy to build for instance a
>> > distributed Java "Map" like this
>> >
>> >
>> https://github.com/eolivelli/pulsar-db/blob/main/src/main/java/org/apache/pulsar/db/PulsarMap.java
>> >
>> >
>> > I believe that we should add this feature to the Pulsar Client API,
>> > maybe we can start by adding this in the pulsar-adapters module as it
>> can
>> > be loosely coupled with the core Pulsar Client
>> >
>> > Building distributed data structures on top of that API is simple,
>> > but the underlying implementation of the core APi is not
>> straightforward,
>> > because there are many
>> > edge cases to deal with.
>> >
>> > If we provide some recipes that are available out-of-the-box we will
>> > unleash the secret power
>> > of Exclusive producer and we will allow more applications to migrate to
>> > Pulsar or to choose Pulsar as storage backbone.
>> >
>> > You can find the code here https://github.com/eolivelli/pulsar-db, it
>> is
>> > only a proof-of-concept, but it is already usable.
>> >
>> > If there is an interest in this I will be happy to draft a PIP
>> > and also to send the implementation to the pulsar-adapters repository.
>> >
>> > Best regards
>> >
>> > Enrico
>> >
>>
>

Re: Enhancing the Pulsar Client with some a Shared State API

Reply via email to