Re: Discovery-based services deployment guarantees question

Vyacheslav Daradur Fri, 15 May 2020 12:41:25 -0700

Mikhail, I merged your changes.
Thanks for your contribution!

On Tue, May 12, 2020 at 8:01 PM Vyacheslav Daradur <daradu...@gmail.com>
wrote:


> Hi Mikhail, proposed changes make sense to me.
> I left some comments to the pr.
> Thank you!
>
> On Wed, May 6, 2020 at 2:28 PM Mikhail Petrov <pmgheap....@gmail.com>
> wrote:
>
>> Hello, Igniters.
>>
>> I am working on IGNITE-12894 - [1]. It seems that it has the root cause
>> which is similar to the problem described in this thread.
>>
>> To solve these problems, I propose to change the behavior of the
>> IgniteServiceProcessor#serviceTopology if the timeout argument is 0.
>> At the moment, IgniteServiceProcessor#serviceTopology returns the
>> topology immediately, regardless of whether it was initialized or not in
>> this case. I propose to wait for the service topology to be initialized
>> if the requested service is already registered on local node, but the
>> full message was not received from the coordinator yet.
>>
>> So the final behavior of IgniteServices#serviceProxy() will be:
>> 1. If the timeout is specified - it waits for the topology over a
>> specified timeout even if the requested service was not registered yet.
>> As in current implementation.
>>
>> 2. If the timeout is not specified - if service was not registered it
>> fails immediately, else it is waiting for the topology initialization
>> (full message from the coordinator) if needed.
>>
>> Here is PR with the implementation of the described proposal - [2].
>>
>> WDYT?
>>
>> [1] - https://issues.apache.org/jira/browse/IGNITE-12894
>> [2] - https://github.com/apache/ignite/pull/7771
>>
>> On 30.12.2019 13:03, Alexey Goncharuk wrote:
>> > Agree, sounds like a plan, thanks for taking over!
>> >
>> > пн, 30 дек. 2019 г. в 13:00, Vyacheslav Daradur <daradu...@gmail.com>:
>> >
>> >> Alexey,
>> >>
>> >> I would not make it default in the current implementation.
>> >>
>> >> Waiting of proxies on non-deployment-initiator nodes should be
>> >> improved - additional checks are required:
>> >> 1) We should not wait if requested service has not been submitted to
>> >> deploy (when there is no info about such service)
>> >> 2) If service deployment failed - getting proxy should be failed or
>> >> interrupted as well (do not wait for all available timeout)
>> >>
>> >> Let's schedule this improvement to next release, I'll try to find a
>> >> time to implement it.
>> >>
>> >> What do you think?
>> >>
>> >> On Mon, Dec 30, 2019 at 12:05 PM Alexey Goncharuk
>> >> <alexey.goncha...@gmail.com> wrote:
>> >>> Vyacheslav, thanks for the explanation, makes sense to me.
>> >>>
>> >>> I was thinking though, should we make the behavior with the timeout
>> >> default
>> >>> for all proxies?
>> >>>
>> >>> Just my opinion - I think for a user it would be hard to control which
>> >> node
>> >>> deploys the service, especially if multiple nodes deploy it
>> concurrently.
>> >>> Most likely users will end up always calling the second option of the
>> >> proxy
>> >>> (with the timeout), so, perhaps, make it default?
>> >>>
>> >>> вс, 29 дек. 2019 г. в 21:05, Vyacheslav Daradur <daradu...@gmail.com
>> >:
>> >>>
>> >>>> Alexey,
>> >>>>
>> >>>> I've prepared pr [1] to show our proxy invocation guarantees and to
>> >>>> avoid misunderstanding.
>> >>>>
>> >>>> Please, let me know if you think that we should improve our
>> guaranties
>> >>>> in some cases.
>> >>>>
>> >>>> [1] https://github.com/apache/ignite/pull/7213
>> >>>>
>> >>>> On Tue, Dec 24, 2019 at 7:27 PM Vyacheslav Daradur <
>> >> daradu...@gmail.com>
>> >>>> wrote:
>> >>>>>> even the local deployment looks broken: if a compute job
>> >>>>>> is sent to a remote node after the service deployment
>> >>>>> This is a different case and covered by retries:
>> >>>>> * If you deploy a service from node A to node B, then take a proxy
>> >>>>> from node A (deployment initiator) it should NOT fail even if node B
>> >>>>> has not received yet a message that deployment finished
>> successfully,
>> >>>>> because of proxy invocation retries.
>> >>>>>
>> >>>>> Look like It's better to describe all these cases on the wiki.
>> >>>>>
>> >>>>>> Should we schedule this ticket for the further work on Services
>> >> IEP?
>> >>>>> If it is a frequent use-case we definitely should implement it.
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Dec 24, 2019 at 6:55 PM Alexey Goncharuk
>> >>>>> <alexey.goncha...@gmail.com> wrote:
>> >>>>>> Ok, got it.
>> >>>>>>
>> >>>>>> I agree that this is consistent with the old behavior, but this is
>> >> the
>> >>>> kind
>> >>>>>> of errors we wanted to get rid of when we started the IEP. From the
>> >>>>>> user perspective, even the local deployment looks broken: if a
>> >> compute
>> >>>> job
>> >>>>>> is sent to a remote node after the service deployment, the job
>> >>>> execution
>> >>>>>> may fail due to this error.
>> >>>>>>
>> >>>>>> Should we schedule this ticket for the further work on Services
>> >> IEP?
>> >>>>>> вт, 24 дек. 2019 г. в 18:49, Vyacheslav Daradur <
>> >> daradu...@gmail.com>:
>> >>>>>>> Not sure that "user fallback" is the right definition, it is not
>> >> new
>> >>>>>>> behaviour in comparison with legacy implementation.
>> >>>>>>>
>> >>>>>>> Our synchronous deployment provides guaranties for a deployment
>> >>>>>>> initiator to be able to start work with service immediately after
>> >>>>>>> deployment finished successfully.
>> >>>>>>> For not the deployment initiator we can't provide such guarantees
>> >>>> now,
>> >>>>>>> because of unknown deployment result and possibly fail.
>> >>>>>>>
>> >>>>>>> In this case, a reasonable timeout might be an acceptable
>> >> solution.
>> >>>>>>> We can improve guaranties in future releases, but there is an
>> >> open
>> >>>>>>> question:
>> >>>>>>> - how long taking of proxy should wait? - deployment of "heavy"
>> >>>>>>> service may take a while
>> >>>>>>>
>> >>>>>>> On Tue, Dec 24, 2019 at 6:19 PM Alexey Goncharuk
>> >>>>>>> <alexey.goncha...@gmail.com> wrote:
>> >>>>>>>> What should be the user fallback in this case? Retry
>> >> infinitely? Is
>> >>>>>>> there a
>> >>>>>>>> way to wait for the proper deployment?
>> >>>>>>>>
>> >>>>>>>> вт, 24 дек. 2019 г. в 12:41, Vyacheslav Daradur <
>> >>>> daradu...@gmail.com>:
>> >>>>>>>>> I’ll take a look at the end of the week.
>> >>>>>>>>>
>> >>>>>>>>> There is one more use-case:
>> >>>>>>>>> * if you initiate deployment from node A, but getting proxy
>> >> on
>> >>>> node B
>> >>>>>>>>> (which isn’t deployment initiator) to call service on node A
>> >> -
>> >>>> it may
>> >>>>>>> fail
>> >>>>>>>>> with "service not found", this is expected behaviour because
>> >> we
>> >>>> didn't
>> >>>>>>>>> provide such guarantees.
>> >>>>>>>>>
>> >>>>>>>>> API of getting proxy with timeout should be used in this
>> >> case:
>> >>>>>>>>> T serviceProxy(String name, Class<? super T> svcItf, boolean
>> >>>> sticky,
>> >>>>>>> long
>> >>>>>>>>> timeout)
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> вт, 24 дек. 2019 г. в 12:11, Alexey Goncharuk <
>> >>>>>>> alexey.goncha...@gmail.com
>> >>>>>>>>>> :
>> >>>>>>>>>> Well, this is exactly the case. The service is deployed
>> >> from
>> >>>> node A,
>> >>>>>>> the
>> >>>>>>>>>> proxy is created on node B, and "service not found"
>> >> exception
>> >>>> gets
>> >>>>>>> thrown
>> >>>>>>>>>> to a user anyway. Perhaps, the retry happens too fast?
>> >>>>>>>>>>
>> >>>>>>>>>> Created a ticket [1].
>> >>>>>>>>>>
>> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12490
>> >>>>>>>>>>
>> >>>>>>>>>> пн, 23 дек. 2019 г. в 22:08, Vyacheslav Daradur <
>> >>>> daradu...@gmail.com
>> >>>>>>>> :
>> >>>>>>>>>>> Hi, Alexey
>> >>>>>>>>>>>
>> >>>>>>>>>>> Please attach a reproducer to the ticket.
>> >>>>>>>>>>>
>> >>>>>>>>>>> As far as I remember we have the following behaviour for
>> >> the
>> >>>>>>> proxies:
>> >>>>>>>>>>> Let's assume you have deployed service from node A, then:
>> >>>>>>>>>>> * if you invoke service locally from node A - it is
>> >>>> guaranteed to
>> >>>>>>>>>>> service to be deployed and ready to work
>> >>>>>>>>>>> * if you take a proxy from node A to remote node B right
>> >>>> after
>> >>>>>>> deploy
>> >>>>>>>>>>> - there is might be a race between disco-spi (a message
>> >> which
>> >>>>>>> releases
>> >>>>>>>>>>> deployed service)  and comm-spi (remote call works via
>> >>>> Compute over
>> >>>>>>>>>>> comm-spi), but it shouldn't affect end-users because the
>> >>>> failed
>> >>>>>>>>>>> request will be retried in this case
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Mon, Dec 23, 2019 at 6:55 PM Alexey Goncharuk
>> >>>>>>>>>>> <alexey.goncha...@gmail.com> wrote:
>> >>>>>>>>>>>> Nikolay,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Yes, I've rechecked, the new service processor is being
>> >>>> used.
>> >>>>>>> I'll
>> >>>>>>>>>> file a
>> >>>>>>>>>>>> bug shortly.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> пн, 23 дек. 2019 г. в 17:33, Николай Ижиков <
>> >>>> nizhi...@apache.org
>> >>>>>>>> :
>> >>>>>>>>>>>>> Alexey, are you sure, you are testing new service
>> >>>> framework?
>> >>>>>>>>>>>>> Is yes - you definitely should file a bug.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 23 дек. 2019 г., в 17:02, Alexey Goncharuk <
>> >>>>>>>>>>> alexey.goncha...@gmail.com>
>> >>>>>>>>>>>>> написал(а):
>> >>>>>>>>>>>>>> Igniters,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I have a question based on one of my recent tests
>> >>>> debugging.
>> >>>>>>>>>>>>>> The test is related to Ignite services. I noticed
>> >> that
>> >>>>>>> sometimes
>> >>>>>>>>> a
>> >>>>>>>>>>> proxy
>> >>>>>>>>>>>>>> invocation of a newly deployed service fails
>> >> because
>> >>>> the
>> >>>>>>> service
>> >>>>>>>>>>> cannot
>> >>>>>>>>>>>>> be
>> >>>>>>>>>>>>>> found. I managed to reduce the test to a simple
>> >> "start
>> >>>> two
>> >>>>>>> nodes,
>> >>>>>>>>>>> deploy
>> >>>>>>>>>>>>> a
>> >>>>>>>>>>>>>> service, create a proxy, invoke the proxy"
>> >> scenario.
>> >>>> The
>> >>>>>>> proxy
>> >>>>>>>>>>> invocation
>> >>>>>>>>>>>>>> fails in about ~80% of runs.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> As far as I remember, the new discovery-based
>> >> service
>> >>>>>>> deployment
>> >>>>>>>>>> was
>> >>>>>>>>>>>>>> supposed to be synchronous, so not only non-proxy
>> >>>> service
>> >>>>>>>>> instances
>> >>>>>>>>>>>>> should
>> >>>>>>>>>>>>>> work, but the proxies as well. Was my understanding
>> >>>> correct?
>> >>>>>>>>>> Should I
>> >>>>>>>>>>>>> file
>> >>>>>>>>>>>>>> a bug for the observed behavior?
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> --AG
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Best Regards, Vyacheslav D.
>> >>>>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Best Regards, Vyacheslav D.
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Best Regards, Vyacheslav D.
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Best Regards, Vyacheslav D.
>> >>>>
>> >>
>> >>
>> >> --
>> >> Best Regards, Vyacheslav D.
>> >>
>>
>
>
> --
> Best Regards,
> Vyacheslav D.
>


-- 
Best Regards,
Vyacheslav D.

Re: Discovery-based services deployment guarantees question

Reply via email to