Re: Discovery-based services deployment guarantees question

Mikhail Petrov Wed, 06 May 2020 04:28:36 -0700

Hello, Igniters.

I am working on IGNITE-12894 - [1]. It seems that it has the root causewhich is similar to the problem described in this thread.

To solve these problems, I propose to change the behavior of theIgniteServiceProcessor#serviceTopology if the timeout argument is 0.At the moment, IgniteServiceProcessor#serviceTopology returns thetopology immediately, regardless of whether it was initialized or not inthis case. I propose to wait for the service topology to be initializedif the requested service is already registered on local node, but thefull message was not received from the coordinator yet.


So the final behavior of IgniteServices#serviceProxy() will be:

1. If the timeout is specified - it waits for the topology over aspecified timeout even if the requested service was not registered yet.As in current implementation.

2. If the timeout is not specified - if service was not registered itfails immediately, else it is waiting for the topology initialization(full message from the coordinator) if needed.


Here is PR with the implementation of the described proposal - [2].

WDYT?

[1] - https://issues.apache.org/jira/browse/IGNITE-12894
[2] - https://github.com/apache/ignite/pull/7771

On 30.12.2019 13:03, Alexey Goncharuk wrote:

Agree, sounds like a plan, thanks for taking over!

пн, 30 дек. 2019 г. в 13:00, Vyacheslav Daradur <[email protected]>:

Alexey,

I would not make it default in the current implementation.

Waiting of proxies on non-deployment-initiator nodes should be
improved - additional checks are required:
1) We should not wait if requested service has not been submitted to
deploy (when there is no info about such service)
2) If service deployment failed - getting proxy should be failed or
interrupted as well (do not wait for all available timeout)

Let's schedule this improvement to next release, I'll try to find a
time to implement it.

What do you think?

On Mon, Dec 30, 2019 at 12:05 PM Alexey Goncharuk
<[email protected]> wrote:

Vyacheslav, thanks for the explanation, makes sense to me.

I was thinking though, should we make the behavior with the timeout

default

for all proxies?

Just my opinion - I think for a user it would be hard to control which

node

deploys the service, especially if multiple nodes deploy it concurrently.
Most likely users will end up always calling the second option of the

proxy

(with the timeout), so, perhaps, make it default?

вс, 29 дек. 2019 г. в 21:05, Vyacheslav Daradur <[email protected]>:

Alexey,

I've prepared pr [1] to show our proxy invocation guarantees and to
avoid misunderstanding.

Please, let me know if you think that we should improve our guaranties
in some cases.

[1] https://github.com/apache/ignite/pull/7213

On Tue, Dec 24, 2019 at 7:27 PM Vyacheslav Daradur <

[email protected]>

wrote:

even the local deployment looks broken: if a compute job
is sent to a remote node after the service deployment

This is a different case and covered by retries:
* If you deploy a service from node A to node B, then take a proxy
from node A (deployment initiator) it should NOT fail even if node B
has not received yet a message that deployment finished successfully,
because of proxy invocation retries.

Look like It's better to describe all these cases on the wiki.

Should we schedule this ticket for the further work on Services

IEP?

If it is a frequent use-case we definitely should implement it.


On Tue, Dec 24, 2019 at 6:55 PM Alexey Goncharuk
<[email protected]> wrote:

Ok, got it.

I agree that this is consistent with the old behavior, but this is

the

kind

of errors we wanted to get rid of when we started the IEP. From the
user perspective, even the local deployment looks broken: if a

compute

job

is sent to a remote node after the service deployment, the job

execution

may fail due to this error.

Should we schedule this ticket for the further work on Services

IEP?

вт, 24 дек. 2019 г. в 18:49, Vyacheslav Daradur <

[email protected]>:

Not sure that "user fallback" is the right definition, it is not

new

behaviour in comparison with legacy implementation.

Our synchronous deployment provides guaranties for a deployment
initiator to be able to start work with service immediately after
deployment finished successfully.
For not the deployment initiator we can't provide such guarantees

now,

because of unknown deployment result and possibly fail.

In this case, a reasonable timeout might be an acceptable

solution.

We can improve guaranties in future releases, but there is an

open

question:
- how long taking of proxy should wait? - deployment of "heavy"
service may take a while

On Tue, Dec 24, 2019 at 6:19 PM Alexey Goncharuk
<[email protected]> wrote:

What should be the user fallback in this case? Retry

infinitely? Is

there a

way to wait for the proper deployment?

вт, 24 дек. 2019 г. в 12:41, Vyacheslav Daradur <

[email protected]>:

I’ll take a look at the end of the week.

There is one more use-case:
* if you initiate deployment from node A, but getting proxy

on

node B

(which isn’t deployment initiator) to call service on node A

it may

fail

with "service not found", this is expected behaviour because

we

didn't

provide such guarantees.

API of getting proxy with timeout should be used in this

case:

T serviceProxy(String name, Class<? super T> svcItf, boolean

sticky,

long

timeout)


вт, 24 дек. 2019 г. в 12:11, Alexey Goncharuk <

[email protected]

:
Well, this is exactly the case. The service is deployed

from

node A,

the

proxy is created on node B, and "service not found"

exception

gets

thrown

to a user anyway. Perhaps, the retry happens too fast?

Created a ticket [1].

[1] https://issues.apache.org/jira/browse/IGNITE-12490

пн, 23 дек. 2019 г. в 22:08, Vyacheslav Daradur <

[email protected]

Hi, Alexey

Please attach a reproducer to the ticket.

As far as I remember we have the following behaviour for

the

proxies:

Let's assume you have deployed service from node A, then:
* if you invoke service locally from node A - it is

guaranteed to

service to be deployed and ready to work
* if you take a proxy from node A to remote node B right

after

deploy

- there is might be a race between disco-spi (a message

which

releases

deployed service)  and comm-spi (remote call works via

Compute over

comm-spi), but it shouldn't affect end-users because the

failed

request will be retried in this case




On Mon, Dec 23, 2019 at 6:55 PM Alexey Goncharuk
<[email protected]> wrote:

Nikolay,

Yes, I've rechecked, the new service processor is being

used.

I'll

file a

bug shortly.

пн, 23 дек. 2019 г. в 17:33, Николай Ижиков <

[email protected]

Alexey, are you sure, you are testing new service

framework?

Is yes - you definitely should file a bug.

23 дек. 2019 г., в 17:02, Alexey Goncharuk <

[email protected]>

написал(а):

Igniters,

I have a question based on one of my recent tests

debugging.

The test is related to Ignite services. I noticed

that

sometimes

proxy

invocation of a newly deployed service fails

because

the

service

cannot

be

found. I managed to reduce the test to a simple

"start

two

nodes,

deploy

service, create a proxy, invoke the proxy"

scenario.

The

proxy

invocation

fails in about ~80% of runs.

As far as I remember, the new discovery-based

service

deployment

was

supposed to be synchronous, so not only non-proxy

service

instances

should

work, but the proxies as well. Was my understanding

correct?

Should I

file

a bug for the observed behavior?

--AG



--
Best Regards, Vyacheslav D.



--
Best Regards, Vyacheslav D.



--
Best Regards, Vyacheslav D.



--
Best Regards, Vyacheslav D.



--
Best Regards, Vyacheslav D.

Re: Discovery-based services deployment guarantees question

Reply via email to