Ilya, I would suggest that you discuss everything here on the dev@ list, including suggestions how to split the work.
D. On Tue, Sep 18, 2018 at 6:34 PM Ilya Lantukh <ilant...@gridgain.com> wrote: > Thanks for the feedback! > > I agree that we should start with the simplest optimizations, but it seems > that decoupling affinity/topology versions is necessary before we can make > any progress here, and this is a rather complex change all over the code. > > If anyone wants to help, please contact me privately and we will discuss > how this work can be split. > > Denis Magda, do you think we should create IEP for these optimizations? > > On Tue, Sep 18, 2018 at 5:59 PM, Maxim Muzafarov <maxmu...@gmail.com> > wrote: > > > Ilya, > > > > > > > 3. Start node in baseline: both affinity and topology versions should > be > > incremented, but it might be possible to optimize PME for such case and > > avoid cluster-wide freeze. Partition assignments for such node are > already > > calculated, so we can simply put them all into MOVING state. However, it > > might take significant effort to avoid race conditions and redesign our > > architecture. > > > > As you mentioned all assignments are already calculated. So as another > > suggestion, > > we can introduce a new `intermediate` state of such joined nodes. Being > in > > this state > > node recovers all data from their local storage, preloads whole missed > > partition > > data from the cluster (probably on some point in time), creates and > > preloads missed > > in-memory and persistent caches. And only after these recovery such node > > will fire > > discovery join event and affinity\topology version may be incremented. I > > think this > > approach can help to reduce the further rebalance time. > > WDYT? > > > > > > > > On Tue, 18 Sep 2018 at 16:31 Alexey Goncharuk < > alexey.goncha...@gmail.com> > > wrote: > > > > > Ilya, > > > > > > This is a great idea, but before we can ultimately decouple the > affinity > > > version from the topology version, we need to fix a few things with > > > baseline topology first. Currently the in-memory caches are not using > the > > > baseline topology. We are going to fix this as a part of IEP-4 Phase II > > > (baseline auto-adjust). Once fixed, we can safely assume that > > > out-of-baseline node does not affect affinity distribution. > > > > > > Agree with Dmitriy that we should start with simpler optimizations > first. > > > > > > чт, 13 сент. 2018 г. в 15:58, Ilya Lantukh <ilant...@gridgain.com>: > > > > > > > Igniters, > > > > > > > > As most of you know, Ignite has a concept of AffinityTopologyVersion, > > > which > > > > is associated with nodes that are currently present in topology and a > > > > global cluster state (active/inactive, baseline topology, started > > > caches). > > > > Modification of either of them involves process called Partition Map > > > > Exchange (PME) and results in new AffinityTopologyVersion. At that > > moment > > > > all new cache and compute grid operations are globally "frozen". This > > > might > > > > lead to indeterminate cache downtimes. > > > > > > > > However, our recent changes (esp. introduction of Baseline Topology) > > > caused > > > > me to re-think those concept. Currently there are many cases when we > > > > trigger PME, but it isn't necessary. For example, adding/removing > > client > > > > node or server node not in BLT should never cause partition map > > > > modifications. Those events modify the *topology*, but *affinity* in > > > > unaffected. On the other hand, there are events that affect only > > > *affinity* > > > > - most straightforward example is CacheAffinityChange event, which is > > > > triggered after rebalance is finished to assign new primary/backup > > nodes. > > > > So the term *AffinityTopologyVersion* now looks weird - it tries to > > > "merge" > > > > two entities that aren't always related. To me it makes sense to > > > introduce > > > > separate *AffinityVersion *and *TopologyVersion*, review all events > > that > > > > currently modify AffinityTopologyVersion and split them into 3 > > > categories: > > > > those that modify only AffinityVersion, only TopologyVersion and > both. > > It > > > > will allow us to process such events using different mechanics and > > avoid > > > > redundant steps, and also reconsider mapping of operations - some > will > > be > > > > mapped to topology, others - to affinity. > > > > > > > > Here is my view about how different event types theoretically can be > > > > optimized: > > > > 1. Client node start / stop: as stated above, no PME is needed, > ticket > > > > https://issues.apache.org/jira/browse/IGNITE-9558 is already in > > > progress. > > > > 2. Server node start / stop not from baseline: should be similar to > the > > > > previous case, since nodes outside of baseline cannot be partition > > > owners. > > > > 3. Start node in baseline: both affinity and topology versions should > > be > > > > incremented, but it might be possible to optimize PME for such case > and > > > > avoid cluster-wide freeze. Partition assignments for such node are > > > already > > > > calculated, so we can simply put them all into MOVING state. However, > > it > > > > might take significant effort to avoid race conditions and redesign > our > > > > architecture. > > > > 4. Cache start / stop: starting or stopping one cache doesn't modify > > > > partition maps for other caches. It should be possible to change this > > > > procedure to skip PME and perform all necessary actions (compute > > > affinity, > > > > start/stop cache contexts on each node) in background, but it looks > > like > > > a > > > > very complex modification too. > > > > 5. Rebalance finish: it seems possible to design a "lightweight" PME > > for > > > > this case as well. If there were no node failures (and if there were, > > PME > > > > should be triggered and rebalance should be cancelled anyways) all > > > > partition states are already known by coordinator. Furthermore, no > new > > > > MOVING or OWNING node for any partition is introduced, so all > previous > > > > mappings should still be valid. > > > > > > > > For the latter complex cases in might be necessary to introduce "is > > > > compatible" relationship between affinity versions. Operation needs > to > > be > > > > remapped only if new version isn't compatible with the previous one. > > > > > > > > Please share your thoughts. > > > > > > > > -- > > > > Best regards, > > > > Ilya > > > > > > > > > -- > > -- > > Maxim Muzafarov > > > > > > -- > Best regards, > Ilya >