Jack, Il giorno lun 15 nov 2021 alle ore 10:37 Jack Vanlightly <jvanligh...@splunk.com.invalid> ha scritto:
> Hi Enrico, > > This is interesting. I'm thinking of the KTables part of Kafka Streams and > also Raft state machines. > > You could build something equivalent to a Raft state machine on top of > Pulsar where WaitForExclusive acts as leader election and the topic as the > log. Yes, that's the idea. > I would be interested in a PIP, that also includes considerations for > things like: > > - how applications can persist an applied index to avoid having to > rebuild state from scratch > - how to manage topic size such that it isn't easy for users to end up > with inconsistent views of the state. Things like checkpointing, may be > having more control over topic data retention. > > Also how does this compare to PIP-104 Table View? ( > > http://mail-archives.apache.org/mod_mbox/pulsar-dev/202110.mbox/%3cCA+JmKXbGf3CMy5dxypX=so-gyxrdh0pmttkhy3wnnm4azgw...@mail.gmail.com%3e > ) > This is partially related but the intent is different. I don't want to create a "database" using Pulsar (like ksqlDB is for Kafka), that would be interesting but that's not the scope of this feature. I saw few times that you need some shared state among the instances of your application, and unfortunately you end up in adding some additional component, like a small external SQL DB (like PostGRE) or something like HerdDB (that is still a SQL DB but can store data on BookKeeper and also can run inside the same JVM of the applications). We already have something like this to manage Pulsar Functions assignments, or in Kafka Connect Adapter to emulate consumer group management. I came to implement this in a bunch of other projects. We need in Pulsar something that is very lightweight but that allows you to synchronize some state among your clients. A similar tool is present in Pravega.io with the State Synchronizers abstraction https://pravega.io/docs/v0.7.1/state-synchronizer-design/ This is why I don't think that initially we need some mechanism to store the data locally. But that's something that we can provide, one step at a time. > > Finally, have we looked at what the market is asking for? What kind of > product strategy do we have regarding Pulsar? AFAIK The is no "strategy", the project is growing thanks to the contributions of users > This kind of thing could end > up being highly valuable but also a big investment and it would be good to > know how the community is steering Pulsar to make it the most relevant it > can be. > I don't want to add a "Pulsar backed Database", in order to do so it would be better to spin off a new project (or subproject) Enrico > > Jack > > > On Wed, Nov 10, 2021 at 5:18 PM Enrico Olivelli <eolive...@gmail.com> > wrote: > > > [ External sender. Exercise caution. ] > > > > Hello, > > With Pulsar 2.8.0 we have the Exclusive Producer, which allows you to use > > Pulsar as a consistent write-ahead-log for replicated state machines. > > > > It already happened to me a couple of times to need to build some > > replicated state storage on top of Pulsar and I would like to share some > > thoughts. > > > > We can provide some simple built-in mechanism to share some "state" > across > > several instances of an application without adding some Database or other > > components to the architecture: > > - metadata > > - dynamic configuration > > - task assignments > > - key-value database > > > > In general we can provide an API to handle a shared distributed Java > > Object: each client can access the Object and mutate the State, > > ensuring consistency. > > > > I have drafted a small API to build such an abstraction: > > > > public interface PulsarDatabase<V, O> { > > > > /** > > * Read from the current state. > > * @param reader a function that accesses current state and returns a > > value > > * @param latest ensure that the value is the latest > > * @return an handle to the result of the operation > > */ > > <K> CompletableFuture<K> read(Function<V, K> reader, boolean latest); > > > > /* > > * Execute a mutation on the state. > > * The operationsGenerator generates a list of mutations to be > > * written to the log, the operationApplier function > > * is executed to mutate the state after each successful write > > * to the log. Finally the reader function can read from > > * the current status before releasing the write lock. > > * @param operationsGenerator generates a list of mutations > > * @param operationApplier apply each mutation to the current state > > * @param reader read from the status while inside the write lock > > * @param <K> the returned data type > > * @param <O> the operation type > > * @return a handle to the completion of the operation > > */ > > <K> CompletableFuture<K> write(Function<V, List<O>> > > operationsGenerator, > > Function<V, K> reader); > > } > > > > Using this simple abstraction it is easy to build for instance a > > distributed Java "Map" like this > > > > > https://github.com/eolivelli/pulsar-db/blob/main/src/main/java/org/apache/pulsar/db/PulsarMap.java > > > > > > I believe that we should add this feature to the Pulsar Client API, > > maybe we can start by adding this in the pulsar-adapters module as it can > > be loosely coupled with the core Pulsar Client > > > > Building distributed data structures on top of that API is simple, > > but the underlying implementation of the core APi is not straightforward, > > because there are many > > edge cases to deal with. > > > > If we provide some recipes that are available out-of-the-box we will > > unleash the secret power > > of Exclusive producer and we will allow more applications to migrate to > > Pulsar or to choose Pulsar as storage backbone. > > > > You can find the code here https://github.com/eolivelli/pulsar-db, it is > > only a proof-of-concept, but it is already usable. > > > > If there is an interest in this I will be happy to draft a PIP > > and also to send the implementation to the pulsar-adapters repository. > > > > Best regards > > > > Enrico > > >