Re: Triggering rebalancing on timeout or manually if the baseline topology is not reassembled

Pavel Kovalenko Tue, 17 Apr 2018 10:41:41 -0700

Denis,

I've attached example how to manage baseline automatically (It's named
BaselineWatcher). It's just an concept and doesn't cover all possible
cases, but might be good for a start.


2018-04-13 2:14 GMT+03:00 Denis Magda <[email protected]>:

> Pavel, thanks for the suggestions. They would definitely work out. I would
> document the one with the event subscription:
> https://issues.apache.org/jira/browse/IGNITE-8241
>
> Could you help preparing a sample code snippet with such a listener that
> will be added to the doc? I know that there are some caveats related to the
> way how such an event has to be processed.
>
> Ivan, truly like your idea. Alex G., what's your thought on this?
>
> --
> Denis
>
> On Thu, Apr 12, 2018 at 2:22 PM, Ivan Rakov <[email protected]> wrote:
>
> > Guys,
> >
> > I also heard complaints about absence of option to automatically change
> > baseline topology. They absolutely make sense.
> > What Pavel suggested will work as a workaround. I think, in future
> > releases we should give user an option to enable a similar behavior via
> > Ignite Configuration.
> > It may be called "Baseline Topology change policy". I see it as
> rule-based
> > language, which allows to specify conditions of BLT change using several
> > parameters - timeout and minimum allowed number of partition copies left
> > (maybe this option should be provided also on per-cache-group level).
> > Policy can also specify conditions for including new nodes in BLT if they
> > are present - including node attributes filters and so on.
> >
> > What do you think?
> >
> > Best Regards,
> > Ivan Rakov
> >
> >
> > On 12.04.2018 19:41, Pavel Kovalenko wrote:
> >
> >> Denis,
> >>
> >> It's just one of the ways to implement it. We also can subscribe on node
> >> join / fail events to properly track downtime of a node.
> >>
> >> 2018-04-12 19:38 GMT+03:00 Pavel Kovalenko <[email protected]>:
> >>
> >> Denis,
> >>>
> >>> Using our API we can implement this task as follows:
> >>> Do each minute:
> >>> 1) Get all alive server nodes consistent ids =>
> >>> ignite().context().discovery().aliveServerNodes() =>
> >>> mapToConsistentIds().
> >>> 2) Get current baseline topology => ignite().cluster().
> >>> currentBaselineTopology()
> >>> 3) For each node in baseline and not in alive server nodes check
> timeout
> >>> for this node.
> >>> 4) If timeout is reached remove node from baseline
> >>> 5) If baseline is changed set new baseline => ignite().cluster().
> >>> setNewBaseline()
> >>>
> >>>
> >>> 2018-04-12 2:18 GMT+03:00 Denis Magda <[email protected]>:
> >>>
> >>> Pavel, Val,
> >>>>
> >>>> So, it means that the rebalancing will be initiated only after an
> >>>> administrator remove the failed node from the topology, right?
> >>>>
> >>>> Next, imagine that you are that IT administrator who has to automate
> the
> >>>> rebalancing activation if the node failed and not recovered within 1
> >>>> minute. What would you do and what Ignite provides to fulfill the
> task?
> >>>>
> >>>> --
> >>>> Denis
> >>>>
> >>>> On Wed, Apr 11, 2018 at 1:01 PM, Pavel Kovalenko <[email protected]>
> >>>> wrote:
> >>>>
> >>>> Denis,
> >>>>>
> >>>>> In case of incomplete baseline topology IgniteCache.rebalance() will
> do
> >>>>> nothing, because this event doesn't trigger partitions exchange or
> >>>>>
> >>>> affinity
> >>>>
> >>>>> change, so states of existing partitions are hold.
> >>>>>
> >>>>> 2018-04-11 22:27 GMT+03:00 Valentin Kulichenko <
> >>>>> [email protected]>:
> >>>>>
> >>>>> Denis,
> >>>>>>
> >>>>>> In my understanding, in this case you should remove node from BLT
> and
> >>>>>>
> >>>>> that
> >>>>>
> >>>>>> will trigger the rebalancing, no?
> >>>>>>
> >>>>>> -Val
> >>>>>>
> >>>>>> On Wed, Apr 11, 2018 at 12:23 PM, Denis Magda <[email protected]>
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Igniters,
> >>>>>>>
> >>>>>>> As we know the rebalancing doesn't happen if one of the nodes goes
> >>>>>>>
> >>>>>> down,
> >>>>>
> >>>>>> thus, shrinking the baseline topology. It complies with our
> >>>>>>>
> >>>>>> assumption
> >>>>
> >>>>> that
> >>>>>>
> >>>>>>> the node should be recovered soon and there is no need to waste
> >>>>>>> CPU/memory/networking resources of the cluster shifting the data
> >>>>>>>
> >>>>>> around.
> >>>>>
> >>>>>> However, there are always edge cases. I was reasonably asked how to
> >>>>>>>
> >>>>>> trigger
> >>>>>>
> >>>>>>> the rebalancing within the baseline topology manually or on timeout
> >>>>>>>
> >>>>>> if:
> >>>>
> >>>>>     - It's not expected that the failed node would be resurrected in
> >>>>>>>
> >>>>>> the
> >>>>
> >>>>>     nearest time and
> >>>>>>>     - It's not likely that that node will be replaced by the other
> >>>>>>>
> >>>>>> one.
> >>>>
> >>>>> The question. If I call IgniteCache.rebalance() or configure
> >>>>>>> CacheConfiguration.rebalanceTimeout will the rebalancing be fired
> >>>>>>>
> >>>>>> within
> >>>>>
> >>>>>> the baseline topology?
> >>>>>>>
> >>>>>>> --
> >>>>>>> Denis
> >>>>>>>
> >>>>>>>
> >>>
> >
>

Re: Triggering rebalancing on timeout or manually if the baseline topology is not reassembled

Reply via email to