Re: The future of Affinity / Topology concepts and possible PME optimizations.

Dmitriy Setrakyan Tue, 18 Sep 2018 09:37:59 -0700

Ilya, I would suggest that you discuss everything here on the dev@ list,
including suggestions how to split the work.


D.

On Tue, Sep 18, 2018 at 6:34 PM Ilya Lantukh <ilant...@gridgain.com> wrote:

> Thanks for the feedback!
>
> I agree that we should start with the simplest optimizations, but it seems
> that decoupling affinity/topology versions is necessary before we can make
> any progress here, and this is a rather complex change all over the code.
>
> If anyone wants to help, please contact me privately and we will discuss
> how this work can be split.
>
> Denis Magda, do you think we should create IEP for these optimizations?
>
> On Tue, Sep 18, 2018 at 5:59 PM, Maxim Muzafarov <maxmu...@gmail.com>
> wrote:
>
> > Ilya,
> >
> >
> > > 3. Start node in baseline: both affinity and topology versions should
> be
> > incremented, but it might be possible to optimize PME for such case and
> > avoid cluster-wide freeze. Partition assignments for such node are
> already
> > calculated, so we can simply put them all into MOVING state. However, it
> > might take significant effort to avoid race conditions and redesign our
> > architecture.
> >
> > As you mentioned all assignments are already calculated. So as another
> > suggestion,
> > we can introduce a new `intermediate` state of such joined nodes. Being
> in
> > this state
> > node recovers all data from their local storage, preloads whole missed
> > partition
> > data from the cluster (probably on some point in time), creates and
> > preloads missed
> > in-memory and persistent caches. And only after these recovery such node
> > will fire
> > discovery join event and affinity\topology version may be incremented. I
> > think this
> > approach can help to reduce the further rebalance time.
> > WDYT?
> >
> >
> >
> > On Tue, 18 Sep 2018 at 16:31 Alexey Goncharuk <
> alexey.goncha...@gmail.com>
> > wrote:
> >
> > > Ilya,
> > >
> > > This is a great idea, but before we can ultimately decouple the
> affinity
> > > version from the topology version, we need to fix a few things with
> > > baseline topology first. Currently the in-memory caches are not using
> the
> > > baseline topology. We are going to fix this as a part of IEP-4 Phase II
> > > (baseline auto-adjust). Once fixed, we can safely assume that
> > > out-of-baseline node does not affect affinity distribution.
> > >
> > > Agree with Dmitriy that we should start with simpler optimizations
> first.
> > >
> > > чт, 13 сент. 2018 г. в 15:58, Ilya Lantukh <ilant...@gridgain.com>:
> > >
> > > > Igniters,
> > > >
> > > > As most of you know, Ignite has a concept of AffinityTopologyVersion,
> > > which
> > > > is associated with nodes that are currently present in topology and a
> > > > global cluster state (active/inactive, baseline topology, started
> > > caches).
> > > > Modification of either of them involves process called Partition Map
> > > > Exchange (PME) and results in new AffinityTopologyVersion. At that
> > moment
> > > > all new cache and compute grid operations are globally "frozen". This
> > > might
> > > > lead to indeterminate cache downtimes.
> > > >
> > > > However, our recent changes (esp. introduction of Baseline Topology)
> > > caused
> > > > me to re-think those concept. Currently there are many cases when we
> > > > trigger PME, but it isn't necessary. For example, adding/removing
> > client
> > > > node or server node not in BLT should never cause partition map
> > > > modifications. Those events modify the *topology*, but *affinity* in
> > > > unaffected. On the other hand, there are events that affect only
> > > *affinity*
> > > > - most straightforward example is CacheAffinityChange event, which is
> > > > triggered after rebalance is finished to assign new primary/backup
> > nodes.
> > > > So the term *AffinityTopologyVersion* now looks weird - it tries to
> > > "merge"
> > > > two entities that aren't always related. To me it makes sense to
> > > introduce
> > > > separate *AffinityVersion *and *TopologyVersion*, review all events
> > that
> > > > currently modify AffinityTopologyVersion and split them into 3
> > > categories:
> > > > those that modify only AffinityVersion, only TopologyVersion and
> both.
> > It
> > > > will allow us to process such events using different mechanics and
> > avoid
> > > > redundant steps, and also reconsider mapping of operations - some
> will
> > be
> > > > mapped to topology, others - to affinity.
> > > >
> > > > Here is my view about how different event types theoretically can be
> > > > optimized:
> > > > 1. Client node start / stop: as stated above, no PME is needed,
> ticket
> > > > https://issues.apache.org/jira/browse/IGNITE-9558 is already in
> > > progress.
> > > > 2. Server node start / stop not from baseline: should be similar to
> the
> > > > previous case, since nodes outside of baseline cannot be partition
> > > owners.
> > > > 3. Start node in baseline: both affinity and topology versions should
> > be
> > > > incremented, but it might be possible to optimize PME for such case
> and
> > > > avoid cluster-wide freeze. Partition assignments for such node are
> > > already
> > > > calculated, so we can simply put them all into MOVING state. However,
> > it
> > > > might take significant effort to avoid race conditions and redesign
> our
> > > > architecture.
> > > > 4. Cache start / stop: starting or stopping one cache doesn't modify
> > > > partition maps for other caches. It should be possible to change this
> > > > procedure to skip PME and perform all necessary actions (compute
> > > affinity,
> > > > start/stop cache contexts on each node) in background, but it looks
> > like
> > > a
> > > > very complex modification too.
> > > > 5. Rebalance finish: it seems possible to design a "lightweight" PME
> > for
> > > > this case as well. If there were no node failures (and if there were,
> > PME
> > > > should be triggered and rebalance should be cancelled anyways) all
> > > > partition states are already known by coordinator. Furthermore, no
> new
> > > > MOVING or OWNING node for any partition is introduced, so all
> previous
> > > > mappings should still be valid.
> > > >
> > > > For the latter complex cases in might be necessary to introduce "is
> > > > compatible" relationship between affinity versions. Operation needs
> to
> > be
> > > > remapped only if new version isn't compatible with the previous one.
> > > >
> > > > Please share your thoughts.
> > > >
> > > > --
> > > > Best regards,
> > > > Ilya
> > > >
> > >
> > --
> > --
> > Maxim Muzafarov
> >
>
>
>
> --
> Best regards,
> Ilya
>

Re: The future of Affinity / Topology concepts and possible PME optimizations.

Reply via email to