Hi. We have made some progress on the topic.
The JRaft fork is merged to Ignite 3 master, now it's integrated with other ready components. The design of transactional protocol in the first iteration is published on the master [1] [1] https://github.com/apache/ignite-3/tree/main/modules/transactions сб, 20 мар. 2021 г. в 21:00, Alexei Scherbakov <alexey.scherbak...@gmail.com >: > Folks, > > I want to share some information about progress in implementing the raft > protocol in ignite 3, which is a prerequisite for metastorage. > > The implementation will consist of client and server modules. The client > is responsible for interoperability between raft server node and any other > remote/local java process > > I have recently finished a raft client API. The public API part is > available here [1] for review. The entry point is RaftGroupService > interface. The service implementation has not been finished yet and can be > skipped for now. > > As for the server part, currently we are investigating two options. First > is etcd [2] implementation ported to Java. The drawback here is the amount > of work required to make it working. Second option is the adoption of > jraft [3] implementation. It is a full featured implementation already > written in Java, but the code is not quite clean in my opinion and will > require some refactoring. > > The next step is to make a raft client working with server > implementations. At least one is required for the next alpha. It is planned > to have the same client for both server implementations. As soon as both > will be ready, we will compare them by running consistency tests and > benchmarks and drop the worst. I will give the next update when we will > have a working client and at least one server implementation ready. > > [1] https://github.com/apache/ignite-3/pull/59/files > [2] https://github.com/etcd-io/etcd/tree/master/raft > [3] https://github.com/sofastack/sofa-jraft > > пт, 27 нояб. 2020 г. в 20:26, Alexey Goncharuk <alexey.goncha...@gmail.com > >: > >> Folks, thanks to everyone who joined the call. Summary: >> >> - We agree that it may be beneficial to separate metastorage and group >> membership services, however, the abstractions should be clean enough >> so >> that we could implement group membership via metastorage >> - Production cluster setup will involve an administrator 'init' command >> that will initialize the metastorage raft group. Once the metastorage >> is >> initialized, all nodes may be restarted arbitrarily >> - HA cluster must contain at least 3 nodes. 2-node cluster will stop >> progress when one of the nodes fails (due to metastorage requirements) >> - We will provide a 'developer' cluster mode which will allow a 1-node >> setup and auto-initialization without the 'init' command >> - We are targeting centralized affinity calculation that will be stored >> to the metastorage. Metastorage downtime does not necessarily mean >> cluster >> availability (subject to the partition replication protocol choice). It >> would be good to maximally hide the partition object so that we could >> support range partitioning in the future >> >> To discuss at the next meeting (do not hesitate to send questions here >> before the meeting): >> >> - Raft implementation details (API model, porting, etc) >> - Transactions interaction with replication protocol >> - Weaker consistency options >> >> Please add more if I forgot something and let's choose a time for the next >> meeting. >> >> --AG >> >> чт, 26 нояб. 2020 г. в 16:12, Kseniya Romanova <romanova.ks....@gmail.com >> >: >> >> > Done >> > >> > чт, 26 нояб. 2020 г. в 13:18, Ivan Daschinsky <ivanda...@gmail.com>: >> > >> > > Alexey, is it possible to manage call at 16:00 MSK? >> > > >> > > чт, 26 нояб. 2020 г. в 12:30, Alexey Goncharuk < >> > alexey.goncha...@gmail.com >> > > >: >> > > >> > > > Hi Ivan, >> > > > >> > > > Unfortunately, the earliest window available for us is 12:00 MSK (1 >> > hour >> > > > slot), or after 14:30 MSK. Let me know what time works best for you. >> > > > >> > > > ср, 25 нояб. 2020 г. в 21:38, Ivan Daschinsky <ivanda...@gmail.com >> >: >> > > > >> > > > > Alexey, I kindly ask you to move the meeting a little bit earlier, >> > > ideal >> > > > > variant -- in the morning. >> > > > > >> > > > > ср, 25 нояб. 2020 г. в 20:10, Alexey Goncharuk < >> > > > alexey.goncha...@gmail.com >> > > > > >: >> > > > > >> > > > > > Folks, let's have the call on Friday, Nov 27th at 18:00 MSK? We >> can >> > > use >> > > > > the >> > > > > > following waiting room link: >> > > > > > >> > https://zoom.us/j/99450012496?pwd=RWZmOGhCNWlRK0ZpamdOOTZsYTJ0dz09 >> > > > > > >> > > > > > Let me know if this time works for everybody. >> > > > > > >> > > > > > ср, 25 нояб. 2020 г. в 16:42, Alexey Goncharuk < >> > > > > alexey.goncha...@gmail.com >> > > > > > >: >> > > > > > >> > > > > > > Folks, >> > > > > > > >> > > > > > > I've made some edits in IEP-61 [1] regarding the group >> membership >> > > > > service >> > > > > > > and transaction protocol interaction with the replication >> > > > > infrastructure, >> > > > > > > please take a look before our Friday call. >> > > > > > > >> > > > > > > [1] >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure >> > > > > > > >> > > > > > > пн, 23 нояб. 2020 г. в 13:28, Alexey Goncharuk < >> > > > > > alexey.goncha...@gmail.com >> > > > > > > >: >> > > > > > > >> > > > > > >> Thanks, Ivan, >> > > > > > >> >> > > > > > >> Another protocol for group membership worth checking out is >> > RAPID >> > > > [1] >> > > > > (a >> > > > > > >> recent one). Not sure though if there are any available >> > > > > implementations >> > > > > > for >> > > > > > >> it already. >> > > > > > >> >> > > > > > >> [1] >> > > > > > >> > > https://www.usenix.org/system/files/conference/atc18/atc18-suresh.pdf >> > > > > > >> >> > > > > > >> пн, 23 нояб. 2020 г. в 10:46, Ivan Daschinsky < >> > > ivanda...@gmail.com >> > > > >: >> > > > > > >> >> > > > > > >>> Also, here is some interesting reading about gossip, SWIM >> etc. >> > > > > > >>> >> > > > > > >>> 1 -- >> > > > > > >>> >> > > > > >> > http://www.cs.cornell.edu/Info/Projects/Spinglass/public_pdfs/SWIM.pdf >> > > > > > >>> 2 -- >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> http://www.antonkharenko.com/2015/09/swim-distributed-group-membership.html >> > > > > > >>> 3 -- https://github.com/hashicorp/memberlist (Foundation >> > library >> > > > of >> > > > > > >>> hashicorp serf) >> > > > > > >>> 4 -- https://github.com/scalecube/scalecube-cluster -- >> (Java >> > > > > > >>> implementation >> > > > > > >>> of SWIM) >> > > > > > >>> >> > > > > > >>> чт, 19 нояб. 2020 г. в 16:35, Ivan Daschinsky < >> > > ivanda...@gmail.com >> > > > >: >> > > > > > >>> >> > > > > > >>> > >> Friday, Nov 27th work for you? If ok, let's have an >> open >> > > call >> > > > > > then. >> > > > > > >>> > Yes, great >> > > > > > >>> > >> As for the protocol port - we will not be dealing with >> the >> > > > > > >>> > concurrency... >> > > > > > >>> > >>Judging by the Rust port, it seems fairly >> straightforward. >> > > > > > >>> > Yes, they chose split transport and logic. But original Go >> > > > package >> > > > > > from >> > > > > > >>> > etcd (see raft/node.go) contains some heartbeats >> mechanism >> > > etc. >> > > > > > >>> > I agree with you, this seems not to be a huge deal to >> port. >> > > > > > >>> > >> > > > > > >>> > чт, 19 нояб. 2020 г. в 16:13, Alexey Goncharuk < >> > > > > > >>> alexey.goncha...@gmail.com >> > > > > > >>> > >: >> > > > > > >>> > >> > > > > > >>> >> Ivan, >> > > > > > >>> >> >> > > > > > >>> >> Agree, let's have a call to discuss the IEP. I have some >> > more >> > > > > > thoughts >> > > > > > >>> >> regarding how the replication infrastructure works with >> > > > > > >>> >> atomic/transactional caches, will put this info to the >> IEP. >> > > Does >> > > > > > next >> > > > > > >>> >> Friday, Nov 27th work for you? If ok, let's have an open >> > call >> > > > > then. >> > > > > > >>> >> >> > > > > > >>> >> As for the protocol port - we will not be dealing with >> the >> > > > > > concurrency >> > > > > > >>> >> model if we choose this way, this is what I like about >> their >> > > > code >> > > > > > >>> >> structure. Essentially, the raft module is a >> single-threaded >> > > > > > automata >> > > > > > >>> >> which >> > > > > > >>> >> has a callback to process a message, process a tick >> > (timeout) >> > > > and >> > > > > > >>> produces >> > > > > > >>> >> messages that should be sent and log entries that should >> be >> > > > > > persisted. >> > > > > > >>> >> Judging by the Rust port, it seems fairly >> straightforward. >> > > Will >> > > > be >> > > > > > >>> happy >> > > > > > >>> >> to >> > > > > > >>> >> discuss this and other alternatives on the call as well. >> > > > > > >>> >> >> > > > > > >>> >> чт, 19 нояб. 2020 г. в 14:41, Ivan Daschinsky < >> > > > > ivanda...@gmail.com >> > > > > > >: >> > > > > > >>> >> >> > > > > > >>> >> > > Any existing library that can be used to avoid >> > > > re-implementing >> > > > > > the >> > > > > > >>> >> > protocol ourselves? Perhaps, porting the existing >> > > > implementation >> > > > > > to >> > > > > > >>> Java >> > > > > > >>> >> > Personally, I like this idea. Go libraries (either raft >> > > module >> > > > > of >> > > > > > >>> etcd >> > > > > > >>> >> or >> > > > > > >>> >> > serf by Hashicorp) are famous for clean code, good >> design, >> > > > > > >>> stability, >> > > > > > >>> >> not >> > > > > > >>> >> > enormous size. >> > > > > > >>> >> > But, on other side, Go has different model for >> concurrency >> > > and >> > > > > > >>> porting >> > > > > > >>> >> > probably will not be so straightforward. >> > > > > > >>> >> > >> > > > > > >>> >> > >> > > > > > >>> >> > >> > > > > > >>> >> > чт, 19 нояб. 2020 г. в 13:48, Ivan Daschinsky < >> > > > > > ivanda...@gmail.com >> > > > > > >>> >: >> > > > > > >>> >> > >> > > > > > >>> >> > > I'd suggest to discuss this IEP and technical >> details in >> > > > open >> > > > > > ZOOM >> > > > > > >>> >> > > meeting. >> > > > > > >>> >> > > >> > > > > > >>> >> > > чт, 19 нояб. 2020 г. в 13:47, Ivan Daschinsky < >> > > > > > >>> ivanda...@gmail.com>: >> > > > > > >>> >> > > >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> ---------- Forwarded message --------- >> > > > > > >>> >> > >> От: Ivan Daschinsky <ivanda...@gmail.com> >> > > > > > >>> >> > >> Date: чт, 19 нояб. 2020 г. в 13:02 >> > > > > > >>> >> > >> Subject: Re: IEP-61 Technical discussion >> > > > > > >>> >> > >> To: Alexey Goncharuk <alexey.goncha...@gmail.com> >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> Alexey, let's arise another question. Specifically, >> how >> > > > nodes >> > > > > > >>> >> initially >> > > > > > >>> >> > >> find each other (discovery) and how they detect >> > failures. >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> I suppose, that gossip protocol is an ideal >> candidate. >> > > For >> > > > > > >>> example, >> > > > > > >>> >> > >> consul [1] uses this approach, using serf [2] >> library >> > to >> > > > > > discover >> > > > > > >>> >> > members >> > > > > > >>> >> > >> of cluster. >> > > > > > >>> >> > >> Then consul forms raft ensemble (server nodes) and >> > client >> > > > use >> > > > > > >>> raft >> > > > > > >>> >> > >> ensemble only as lock service. >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> PacificA suggests internal heartbeats mechanism for >> > > failure >> > > > > > >>> >> detection of >> > > > > > >>> >> > >> replicated group, but it says nothing about initial >> > > > discovery >> > > > > > of >> > > > > > >>> >> nodes. >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> WDYT? >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> [1] -- >> https://www.consul.io/docs/architecture/gossip >> > > > > > >>> >> > >> [2] -- https://www.serf.io/ >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> чт, 19 нояб. 2020 г. в 12:46, Alexey Goncharuk < >> > > > > > >>> >> > >> alexey.goncha...@gmail.com>: >> > > > > > >>> >> > >> >> > > > > > >>> >> > >>> Following up the Ignite 3.0 scope/development >> approach >> > > > > > threads, >> > > > > > >>> >> this is >> > > > > > >>> >> > >>> a separate thread to discuss technical aspects of >> the >> > > IEP. >> > > > > > >>> >> > >>> >> > > > > > >>> >> > >>> Let's reiterate one more time on the questions >> raised >> > by >> > > > > Ivan >> > > > > > >>> and >> > > > > > >>> >> also >> > > > > > >>> >> > >>> see if there are any other thoughts on the IEP: >> > > > > > >>> >> > >>> >> > > > > > >>> >> > >>> - *Whether to deploy metastorage on a separate >> > subset >> > > > of >> > > > > > the >> > > > > > >>> >> nodes >> > > > > > >>> >> > >>> or allow Ignite to choose these nodes >> > > automatically.* I >> > > > > > >>> think it >> > > > > > >>> >> is >> > > > > > >>> >> > >>> feasible to maintain both modes: by default, >> Ignite >> > > > will >> > > > > > >>> choose >> > > > > > >>> >> > >>> metastorage nodes automatically which >> essentially >> > > will >> > > > > > >>> provide >> > > > > > >>> >> the >> > > > > > >>> >> > same >> > > > > > >>> >> > >>> seamless user experience as TCP discovery SPI - >> no >> > > > > separate >> > > > > > >>> >> roles, >> > > > > > >>> >> > >>> simplistic deployment. For deployments where >> people >> > > > want >> > > > > to >> > > > > > >>> have >> > > > > > >>> >> > more >> > > > > > >>> >> > >>> fine-grained control over the nodes' >> assignments, >> > we >> > > > will >> > > > > > >>> >> provide a >> > > > > > >>> >> > runtime >> > > > > > >>> >> > >>> configuration which will allow pinning >> metastorage >> > > > group >> > > > > to >> > > > > > >>> >> certain >> > > > > > >>> >> > nodes, >> > > > > > >>> >> > >>> thus eliminating the latency concerns. >> > > > > > >>> >> > >>> - *Whether there are any TLA+ specs for the >> > PacificA >> > > > > > >>> protocol.* >> > > > > > >>> >> Not >> > > > > > >>> >> > >>> to my knowledge, but it is known to be used in >> > > > production >> > > > > > by >> > > > > > >>> >> > Microsoft and >> > > > > > >>> >> > >>> other projects, e.g. [1] >> > > > > > >>> >> > >>> >> > > > > > >>> >> > >>> I would like to collect general feedback on the >> IEP, >> > as >> > > > well >> > > > > > as >> > > > > > >>> >> > feedback >> > > > > > >>> >> > >>> on specific parts of it, such as: >> > > > > > >>> >> > >>> >> > > > > > >>> >> > >>> - Metastorage API >> > > > > > >>> >> > >>> - Any existing library that can be used to avoid >> > > > > > >>> re-implementing >> > > > > > >>> >> the >> > > > > > >>> >> > >>> protocol ourselves? Perhaps, porting the >> existing >> > > > > > >>> implementation >> > > > > > >>> >> to >> > > > > > >>> >> > Java >> > > > > > >>> >> > >>> (the way TiKV did with etcd-raft [2] [3]? This >> is a >> > > > very >> > > > > > >>> neat way >> > > > > > >>> >> > btw in my >> > > > > > >>> >> > >>> opinion because I like the finite automata-like >> > > > approach >> > > > > of >> > > > > > >>> the >> > > > > > >>> >> > replication >> > > > > > >>> >> > >>> module, and, additionally, we could sync bug >> fixes >> > > and >> > > > > > >>> >> improvements >> > > > > > >>> >> > from >> > > > > > >>> >> > >>> the upstream project) >> > > > > > >>> >> > >>> >> > > > > > >>> >> > >>> >> > > > > > >>> >> > >>> Thanks, >> > > > > > >>> >> > >>> --AG >> > > > > > >>> >> > >>> >> > > > > > >>> >> > >>> [1] >> > > > > > >>> >> > >>> >> > > > > > >>> >> >> > > > > > >> > > https://cwiki.apache.org/confluence/display/INCUBATOR/PegasusProposal >> > > > > > >>> >> > >>> [2] >> https://github.com/etcd-io/etcd/tree/master/raft >> > > > > > >>> >> > >>> [3] https://github.com/tikv/raft-rs >> > > > > > >>> >> > >>> >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> -- >> > > > > > >>> >> > >> Sincerely yours, Ivan Daschinskiy >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> >> > > > > > >>> >> > >> -- >> > > > > > >>> >> > >> Sincerely yours, Ivan Daschinskiy >> > > > > > >>> >> > >> >> > > > > > >>> >> > > >> > > > > > >>> >> > > >> > > > > > >>> >> > > -- >> > > > > > >>> >> > > Sincerely yours, Ivan Daschinskiy >> > > > > > >>> >> > > >> > > > > > >>> >> > >> > > > > > >>> >> > >> > > > > > >>> >> > -- >> > > > > > >>> >> > Sincerely yours, Ivan Daschinskiy >> > > > > > >>> >> > >> > > > > > >>> >> >> > > > > > >>> > >> > > > > > >>> > >> > > > > > >>> > -- >> > > > > > >>> > Sincerely yours, Ivan Daschinskiy >> > > > > > >>> > >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> -- >> > > > > > >>> Sincerely yours, Ivan Daschinskiy >> > > > > > >>> >> > > > > > >> >> > > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Sincerely yours, Ivan Daschinskiy >> > > > > >> > > > >> > > >> > > >> > > -- >> > > Sincerely yours, Ivan Daschinskiy >> > > >> > >> > > > -- > > Best regards, > Alexei Scherbakov > -- Best regards, Alexei Scherbakov