I am going to put up a PIP this week or the week after Enrico
Il giorno lun 15 nov 2021 alle ore 13:00 Enrico Olivelli < eolive...@gmail.com> ha scritto: > Jack, > > Il giorno lun 15 nov 2021 alle ore 10:37 Jack Vanlightly > <jvanligh...@splunk.com.invalid> ha scritto: > >> Hi Enrico, >> >> This is interesting. I'm thinking of the KTables part of Kafka Streams and >> also Raft state machines. >> >> You could build something equivalent to a Raft state machine on top of >> Pulsar where WaitForExclusive acts as leader election and the topic as the >> log. > > > Yes, that's the idea. > > >> I would be interested in a PIP, that also includes considerations for >> things like: >> >> - how applications can persist an applied index to avoid having to >> rebuild state from scratch >> - how to manage topic size such that it isn't easy for users to end up >> with inconsistent views of the state. Things like checkpointing, may be >> having more control over topic data retention. >> >> Also how does this compare to PIP-104 Table View? ( >> >> http://mail-archives.apache.org/mod_mbox/pulsar-dev/202110.mbox/%3cCA+JmKXbGf3CMy5dxypX=so-gyxrdh0pmttkhy3wnnm4azgw...@mail.gmail.com%3e >> ) >> > > This is partially related but the intent is different. > > I don't want to create a "database" using Pulsar (like ksqlDB is for > Kafka), that would be interesting but that's not the scope of this feature. > > I saw few times that you need some shared state among the instances of > your application, and unfortunately you end up in adding some additional > component, like a small external SQL DB (like PostGRE) > or something like HerdDB (that is still a SQL DB but can store data on > BookKeeper and also can run inside the same JVM of the applications). > > We already have something like this to manage Pulsar Functions > assignments, or in Kafka Connect Adapter to emulate consumer group > management. > I came to implement this in a bunch of other projects. > > We need in Pulsar something that is very lightweight but that allows you > to synchronize some state among your clients. > > A similar tool is present in Pravega.io with the State Synchronizers > abstraction > https://pravega.io/docs/v0.7.1/state-synchronizer-design/ > > This is why I don't think that initially we need some mechanism to store > the data locally. > But that's something that we can provide, one step at a time. > > > >> >> Finally, have we looked at what the market is asking for? What kind of >> product strategy do we have regarding Pulsar? > > > AFAIK The is no "strategy", the project is growing thanks to the > contributions of users > > >> This kind of thing could end >> up being highly valuable but also a big investment and it would be good to >> know how the community is steering Pulsar to make it the most relevant it >> can be. >> > > I don't want to add a "Pulsar backed Database", in order to do so it would > be better to spin off a new project (or subproject) > > > > Enrico > > >> >> Jack >> >> >> On Wed, Nov 10, 2021 at 5:18 PM Enrico Olivelli <eolive...@gmail.com> >> wrote: >> >> > [ External sender. Exercise caution. ] >> > >> > Hello, >> > With Pulsar 2.8.0 we have the Exclusive Producer, which allows you to >> use >> > Pulsar as a consistent write-ahead-log for replicated state machines. >> > >> > It already happened to me a couple of times to need to build some >> > replicated state storage on top of Pulsar and I would like to share some >> > thoughts. >> > >> > We can provide some simple built-in mechanism to share some "state" >> across >> > several instances of an application without adding some Database or >> other >> > components to the architecture: >> > - metadata >> > - dynamic configuration >> > - task assignments >> > - key-value database >> > >> > In general we can provide an API to handle a shared distributed Java >> > Object: each client can access the Object and mutate the State, >> > ensuring consistency. >> > >> > I have drafted a small API to build such an abstraction: >> > >> > public interface PulsarDatabase<V, O> { >> > >> > /** >> > * Read from the current state. >> > * @param reader a function that accesses current state and returns >> a >> > value >> > * @param latest ensure that the value is the latest >> > * @return an handle to the result of the operation >> > */ >> > <K> CompletableFuture<K> read(Function<V, K> reader, boolean >> latest); >> > >> > /* >> > * Execute a mutation on the state. >> > * The operationsGenerator generates a list of mutations to be >> > * written to the log, the operationApplier function >> > * is executed to mutate the state after each successful write >> > * to the log. Finally the reader function can read from >> > * the current status before releasing the write lock. >> > * @param operationsGenerator generates a list of mutations >> > * @param operationApplier apply each mutation to the current state >> > * @param reader read from the status while inside the write lock >> > * @param <K> the returned data type >> > * @param <O> the operation type >> > * @return a handle to the completion of the operation >> > */ >> > <K> CompletableFuture<K> write(Function<V, List<O>> >> > operationsGenerator, >> > Function<V, K> reader); >> > } >> > >> > Using this simple abstraction it is easy to build for instance a >> > distributed Java "Map" like this >> > >> > >> https://github.com/eolivelli/pulsar-db/blob/main/src/main/java/org/apache/pulsar/db/PulsarMap.java >> > >> > >> > I believe that we should add this feature to the Pulsar Client API, >> > maybe we can start by adding this in the pulsar-adapters module as it >> can >> > be loosely coupled with the core Pulsar Client >> > >> > Building distributed data structures on top of that API is simple, >> > but the underlying implementation of the core APi is not >> straightforward, >> > because there are many >> > edge cases to deal with. >> > >> > If we provide some recipes that are available out-of-the-box we will >> > unleash the secret power >> > of Exclusive producer and we will allow more applications to migrate to >> > Pulsar or to choose Pulsar as storage backbone. >> > >> > You can find the code here https://github.com/eolivelli/pulsar-db, it >> is >> > only a proof-of-concept, but it is already usable. >> > >> > If there is an interest in this I will be happy to draft a PIP >> > and also to send the implementation to the pulsar-adapters repository. >> > >> > Best regards >> > >> > Enrico >> > >> >