Folks, I want to share some information about progress in implementing the raft protocol in ignite 3, which is a prerequisite for metastorage.
The implementation will consist of client and server modules. The client is responsible for interoperability between raft server node and any other remote/local java process I have recently finished a raft client API. The public API part is available here [1] for review. The entry point is RaftGroupService interface. The service implementation has not been finished yet and can be skipped for now. As for the server part, currently we are investigating two options. First is etcd [2] implementation ported to Java. The drawback here is the amount of work required to make it working. Second option is the adoption of jraft [3] implementation. It is a full featured implementation already written in Java, but the code is not quite clean in my opinion and will require some refactoring. The next step is to make a raft client working with server implementations. At least one is required for the next alpha. It is planned to have the same client for both server implementations. As soon as both will be ready, we will compare them by running consistency tests and benchmarks and drop the worst. I will give the next update when we will have a working client and at least one server implementation ready. [1] https://github.com/apache/ignite-3/pull/59/files [2] https://github.com/etcd-io/etcd/tree/master/raft [3] https://github.com/sofastack/sofa-jraft пт, 27 нояб. 2020 г. в 20:26, Alexey Goncharuk <alexey.goncha...@gmail.com>: > Folks, thanks to everyone who joined the call. Summary: > > - We agree that it may be beneficial to separate metastorage and group > membership services, however, the abstractions should be clean enough so > that we could implement group membership via metastorage > - Production cluster setup will involve an administrator 'init' command > that will initialize the metastorage raft group. Once the metastorage is > initialized, all nodes may be restarted arbitrarily > - HA cluster must contain at least 3 nodes. 2-node cluster will stop > progress when one of the nodes fails (due to metastorage requirements) > - We will provide a 'developer' cluster mode which will allow a 1-node > setup and auto-initialization without the 'init' command > - We are targeting centralized affinity calculation that will be stored > to the metastorage. Metastorage downtime does not necessarily mean > cluster > availability (subject to the partition replication protocol choice). It > would be good to maximally hide the partition object so that we could > support range partitioning in the future > > To discuss at the next meeting (do not hesitate to send questions here > before the meeting): > > - Raft implementation details (API model, porting, etc) > - Transactions interaction with replication protocol > - Weaker consistency options > > Please add more if I forgot something and let's choose a time for the next > meeting. > > --AG > > чт, 26 нояб. 2020 г. в 16:12, Kseniya Romanova <romanova.ks....@gmail.com > >: > > > Done > > > > чт, 26 нояб. 2020 г. в 13:18, Ivan Daschinsky <ivanda...@gmail.com>: > > > > > Alexey, is it possible to manage call at 16:00 MSK? > > > > > > чт, 26 нояб. 2020 г. в 12:30, Alexey Goncharuk < > > alexey.goncha...@gmail.com > > > >: > > > > > > > Hi Ivan, > > > > > > > > Unfortunately, the earliest window available for us is 12:00 MSK (1 > > hour > > > > slot), or after 14:30 MSK. Let me know what time works best for you. > > > > > > > > ср, 25 нояб. 2020 г. в 21:38, Ivan Daschinsky <ivanda...@gmail.com>: > > > > > > > > > Alexey, I kindly ask you to move the meeting a little bit earlier, > > > ideal > > > > > variant -- in the morning. > > > > > > > > > > ср, 25 нояб. 2020 г. в 20:10, Alexey Goncharuk < > > > > alexey.goncha...@gmail.com > > > > > >: > > > > > > > > > > > Folks, let's have the call on Friday, Nov 27th at 18:00 MSK? We > can > > > use > > > > > the > > > > > > following waiting room link: > > > > > > > > https://zoom.us/j/99450012496?pwd=RWZmOGhCNWlRK0ZpamdOOTZsYTJ0dz09 > > > > > > > > > > > > Let me know if this time works for everybody. > > > > > > > > > > > > ср, 25 нояб. 2020 г. в 16:42, Alexey Goncharuk < > > > > > alexey.goncha...@gmail.com > > > > > > >: > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > I've made some edits in IEP-61 [1] regarding the group > membership > > > > > service > > > > > > > and transaction protocol interaction with the replication > > > > > infrastructure, > > > > > > > please take a look before our Friday call. > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure > > > > > > > > > > > > > > пн, 23 нояб. 2020 г. в 13:28, Alexey Goncharuk < > > > > > > alexey.goncha...@gmail.com > > > > > > > >: > > > > > > > > > > > > > >> Thanks, Ivan, > > > > > > >> > > > > > > >> Another protocol for group membership worth checking out is > > RAPID > > > > [1] > > > > > (a > > > > > > >> recent one). Not sure though if there are any available > > > > > implementations > > > > > > for > > > > > > >> it already. > > > > > > >> > > > > > > >> [1] > > > > > > > > > https://www.usenix.org/system/files/conference/atc18/atc18-suresh.pdf > > > > > > >> > > > > > > >> пн, 23 нояб. 2020 г. в 10:46, Ivan Daschinsky < > > > ivanda...@gmail.com > > > > >: > > > > > > >> > > > > > > >>> Also, here is some interesting reading about gossip, SWIM > etc. > > > > > > >>> > > > > > > >>> 1 -- > > > > > > >>> > > > > > > > http://www.cs.cornell.edu/Info/Projects/Spinglass/public_pdfs/SWIM.pdf > > > > > > >>> 2 -- > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > http://www.antonkharenko.com/2015/09/swim-distributed-group-membership.html > > > > > > >>> 3 -- https://github.com/hashicorp/memberlist (Foundation > > library > > > > of > > > > > > >>> hashicorp serf) > > > > > > >>> 4 -- https://github.com/scalecube/scalecube-cluster -- (Java > > > > > > >>> implementation > > > > > > >>> of SWIM) > > > > > > >>> > > > > > > >>> чт, 19 нояб. 2020 г. в 16:35, Ivan Daschinsky < > > > ivanda...@gmail.com > > > > >: > > > > > > >>> > > > > > > >>> > >> Friday, Nov 27th work for you? If ok, let's have an open > > > call > > > > > > then. > > > > > > >>> > Yes, great > > > > > > >>> > >> As for the protocol port - we will not be dealing with > the > > > > > > >>> > concurrency... > > > > > > >>> > >>Judging by the Rust port, it seems fairly > straightforward. > > > > > > >>> > Yes, they chose split transport and logic. But original Go > > > > package > > > > > > from > > > > > > >>> > etcd (see raft/node.go) contains some heartbeats mechanism > > > etc. > > > > > > >>> > I agree with you, this seems not to be a huge deal to port. > > > > > > >>> > > > > > > > >>> > чт, 19 нояб. 2020 г. в 16:13, Alexey Goncharuk < > > > > > > >>> alexey.goncha...@gmail.com > > > > > > >>> > >: > > > > > > >>> > > > > > > > >>> >> Ivan, > > > > > > >>> >> > > > > > > >>> >> Agree, let's have a call to discuss the IEP. I have some > > more > > > > > > thoughts > > > > > > >>> >> regarding how the replication infrastructure works with > > > > > > >>> >> atomic/transactional caches, will put this info to the > IEP. > > > Does > > > > > > next > > > > > > >>> >> Friday, Nov 27th work for you? If ok, let's have an open > > call > > > > > then. > > > > > > >>> >> > > > > > > >>> >> As for the protocol port - we will not be dealing with the > > > > > > concurrency > > > > > > >>> >> model if we choose this way, this is what I like about > their > > > > code > > > > > > >>> >> structure. Essentially, the raft module is a > single-threaded > > > > > > automata > > > > > > >>> >> which > > > > > > >>> >> has a callback to process a message, process a tick > > (timeout) > > > > and > > > > > > >>> produces > > > > > > >>> >> messages that should be sent and log entries that should > be > > > > > > persisted. > > > > > > >>> >> Judging by the Rust port, it seems fairly straightforward. > > > Will > > > > be > > > > > > >>> happy > > > > > > >>> >> to > > > > > > >>> >> discuss this and other alternatives on the call as well. > > > > > > >>> >> > > > > > > >>> >> чт, 19 нояб. 2020 г. в 14:41, Ivan Daschinsky < > > > > > ivanda...@gmail.com > > > > > > >: > > > > > > >>> >> > > > > > > >>> >> > > Any existing library that can be used to avoid > > > > re-implementing > > > > > > the > > > > > > >>> >> > protocol ourselves? Perhaps, porting the existing > > > > implementation > > > > > > to > > > > > > >>> Java > > > > > > >>> >> > Personally, I like this idea. Go libraries (either raft > > > module > > > > > of > > > > > > >>> etcd > > > > > > >>> >> or > > > > > > >>> >> > serf by Hashicorp) are famous for clean code, good > design, > > > > > > >>> stability, > > > > > > >>> >> not > > > > > > >>> >> > enormous size. > > > > > > >>> >> > But, on other side, Go has different model for > concurrency > > > and > > > > > > >>> porting > > > > > > >>> >> > probably will not be so straightforward. > > > > > > >>> >> > > > > > > > >>> >> > > > > > > > >>> >> > > > > > > > >>> >> > чт, 19 нояб. 2020 г. в 13:48, Ivan Daschinsky < > > > > > > ivanda...@gmail.com > > > > > > >>> >: > > > > > > >>> >> > > > > > > > >>> >> > > I'd suggest to discuss this IEP and technical details > in > > > > open > > > > > > ZOOM > > > > > > >>> >> > > meeting. > > > > > > >>> >> > > > > > > > > >>> >> > > чт, 19 нояб. 2020 г. в 13:47, Ivan Daschinsky < > > > > > > >>> ivanda...@gmail.com>: > > > > > > >>> >> > > > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> ---------- Forwarded message --------- > > > > > > >>> >> > >> От: Ivan Daschinsky <ivanda...@gmail.com> > > > > > > >>> >> > >> Date: чт, 19 нояб. 2020 г. в 13:02 > > > > > > >>> >> > >> Subject: Re: IEP-61 Technical discussion > > > > > > >>> >> > >> To: Alexey Goncharuk <alexey.goncha...@gmail.com> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> Alexey, let's arise another question. Specifically, > how > > > > nodes > > > > > > >>> >> initially > > > > > > >>> >> > >> find each other (discovery) and how they detect > > failures. > > > > > > >>> >> > >> > > > > > > >>> >> > >> I suppose, that gossip protocol is an ideal > candidate. > > > For > > > > > > >>> example, > > > > > > >>> >> > >> consul [1] uses this approach, using serf [2] library > > to > > > > > > discover > > > > > > >>> >> > members > > > > > > >>> >> > >> of cluster. > > > > > > >>> >> > >> Then consul forms raft ensemble (server nodes) and > > client > > > > use > > > > > > >>> raft > > > > > > >>> >> > >> ensemble only as lock service. > > > > > > >>> >> > >> > > > > > > >>> >> > >> PacificA suggests internal heartbeats mechanism for > > > failure > > > > > > >>> >> detection of > > > > > > >>> >> > >> replicated group, but it says nothing about initial > > > > discovery > > > > > > of > > > > > > >>> >> nodes. > > > > > > >>> >> > >> > > > > > > >>> >> > >> WDYT? > > > > > > >>> >> > >> > > > > > > >>> >> > >> [1] -- > https://www.consul.io/docs/architecture/gossip > > > > > > >>> >> > >> [2] -- https://www.serf.io/ > > > > > > >>> >> > >> > > > > > > >>> >> > >> чт, 19 нояб. 2020 г. в 12:46, Alexey Goncharuk < > > > > > > >>> >> > >> alexey.goncha...@gmail.com>: > > > > > > >>> >> > >> > > > > > > >>> >> > >>> Following up the Ignite 3.0 scope/development > approach > > > > > > threads, > > > > > > >>> >> this is > > > > > > >>> >> > >>> a separate thread to discuss technical aspects of > the > > > IEP. > > > > > > >>> >> > >>> > > > > > > >>> >> > >>> Let's reiterate one more time on the questions > raised > > by > > > > > Ivan > > > > > > >>> and > > > > > > >>> >> also > > > > > > >>> >> > >>> see if there are any other thoughts on the IEP: > > > > > > >>> >> > >>> > > > > > > >>> >> > >>> - *Whether to deploy metastorage on a separate > > subset > > > > of > > > > > > the > > > > > > >>> >> nodes > > > > > > >>> >> > >>> or allow Ignite to choose these nodes > > > automatically.* I > > > > > > >>> think it > > > > > > >>> >> is > > > > > > >>> >> > >>> feasible to maintain both modes: by default, > Ignite > > > > will > > > > > > >>> choose > > > > > > >>> >> > >>> metastorage nodes automatically which essentially > > > will > > > > > > >>> provide > > > > > > >>> >> the > > > > > > >>> >> > same > > > > > > >>> >> > >>> seamless user experience as TCP discovery SPI - > no > > > > > separate > > > > > > >>> >> roles, > > > > > > >>> >> > >>> simplistic deployment. For deployments where > people > > > > want > > > > > to > > > > > > >>> have > > > > > > >>> >> > more > > > > > > >>> >> > >>> fine-grained control over the nodes' assignments, > > we > > > > will > > > > > > >>> >> provide a > > > > > > >>> >> > runtime > > > > > > >>> >> > >>> configuration which will allow pinning > metastorage > > > > group > > > > > to > > > > > > >>> >> certain > > > > > > >>> >> > nodes, > > > > > > >>> >> > >>> thus eliminating the latency concerns. > > > > > > >>> >> > >>> - *Whether there are any TLA+ specs for the > > PacificA > > > > > > >>> protocol.* > > > > > > >>> >> Not > > > > > > >>> >> > >>> to my knowledge, but it is known to be used in > > > > production > > > > > > by > > > > > > >>> >> > Microsoft and > > > > > > >>> >> > >>> other projects, e.g. [1] > > > > > > >>> >> > >>> > > > > > > >>> >> > >>> I would like to collect general feedback on the IEP, > > as > > > > well > > > > > > as > > > > > > >>> >> > feedback > > > > > > >>> >> > >>> on specific parts of it, such as: > > > > > > >>> >> > >>> > > > > > > >>> >> > >>> - Metastorage API > > > > > > >>> >> > >>> - Any existing library that can be used to avoid > > > > > > >>> re-implementing > > > > > > >>> >> the > > > > > > >>> >> > >>> protocol ourselves? Perhaps, porting the existing > > > > > > >>> implementation > > > > > > >>> >> to > > > > > > >>> >> > Java > > > > > > >>> >> > >>> (the way TiKV did with etcd-raft [2] [3]? This > is a > > > > very > > > > > > >>> neat way > > > > > > >>> >> > btw in my > > > > > > >>> >> > >>> opinion because I like the finite automata-like > > > > approach > > > > > of > > > > > > >>> the > > > > > > >>> >> > replication > > > > > > >>> >> > >>> module, and, additionally, we could sync bug > fixes > > > and > > > > > > >>> >> improvements > > > > > > >>> >> > from > > > > > > >>> >> > >>> the upstream project) > > > > > > >>> >> > >>> > > > > > > >>> >> > >>> > > > > > > >>> >> > >>> Thanks, > > > > > > >>> >> > >>> --AG > > > > > > >>> >> > >>> > > > > > > >>> >> > >>> [1] > > > > > > >>> >> > >>> > > > > > > >>> >> > > > > > > > > > https://cwiki.apache.org/confluence/display/INCUBATOR/PegasusProposal > > > > > > >>> >> > >>> [2] > https://github.com/etcd-io/etcd/tree/master/raft > > > > > > >>> >> > >>> [3] https://github.com/tikv/raft-rs > > > > > > >>> >> > >>> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> -- > > > > > > >>> >> > >> Sincerely yours, Ivan Daschinskiy > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> -- > > > > > > >>> >> > >> Sincerely yours, Ivan Daschinskiy > > > > > > >>> >> > >> > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > > >>> >> > > -- > > > > > > >>> >> > > Sincerely yours, Ivan Daschinskiy > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > >>> >> > > > > > > > >>> >> > -- > > > > > > >>> >> > Sincerely yours, Ivan Daschinskiy > > > > > > >>> >> > > > > > > > >>> >> > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > -- > > > > > > >>> > Sincerely yours, Ivan Daschinskiy > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > > >>> -- > > > > > > >>> Sincerely yours, Ivan Daschinskiy > > > > > > >>> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sincerely yours, Ivan Daschinskiy > > > > > > > > > > > > > > > > > > -- > > > Sincerely yours, Ivan Daschinskiy > > > > > > -- Best regards, Alexei Scherbakov