Igniters,

I'd like to start a discussion on Ignite service grid redesign.
We have a number of problems in our current architecture, that have to be
addressed.

Here are the most severe ones:

One of them is lack of guarantee, that service is successfully deployed and
ready for work by the time, when *IgniteService.deploy*()* methods return.
Furthermore, if an exception is thrown from *Service.init() *method, then
the deploying side is not able to receive it, or even understand, that
service is in unusable state.
So, you may end up in such situation, when you deployed a service without
receiving any errors, then called a service's method, and hung indefinitely
on this invocation.
JIRA ticket: https://issues.apache.org/jira/browse/IGNITE-3392

Another problem is locking during service deployment on unstable topology.
This issue is caused by missing updates in continuous query listeners on
the internal cache.
It is hard to reproduce, but it happens sometimes. We shouldn't allow such
possibility, that deployment methods hang without saying anything.
JIRA ticket: https://issues.apache.org/jira/browse/IGNITE-6259

I think, we should change the deployment procedure to make it more reliable.
Moving from operating over internal replicated service cache to sending
custom discovery events seems to be a good idea.
Service deployment may trigger a discovery event, that will make chosen
nodes deploy the service, and the same event will notify other nodes about
the deployed service instances.
It will eliminate the need for distributed transactions on the internal
replicated system cache, and make the service deployment protocol more
transparent.

There are a few points, that should be taken into account though.

First of all, we can't wait for services to be deployed and initialised in
the discovery thread.
So, we need to make notification about service deployment result
asynchronous, presumably over communication protocol.
I can think of a procedure similar to the current exchange protocol, when
service deployment is initialised with an initial discovery message,
followed by asynchronous notifications from the hosting servers over
communication. And finally, one more discovery message will notify all
nodes about the service deployment result and location of the deployed
service instances. Coordinator will be responsible for collecting of the
deployment results in this scheme.

Another problem is failover in case, when some nodes fail during deployment
or further work.
The following cases should be handled:

   1. coordinator failure during deployment;
   2. failure of nodes, that were chosen to host the service, during
   deployment;
   3. failure of nodes, that contain deployed services, after the
   deployment.

The first case may be resolved by either continuation of deployment with a
new coordinator, or by cancelling it.
The second case will require another node to be chosen and notified. Maybe
another discovery message will be needed.
The third case will require redeployment, so coordinator should track
topology changes and redeploy failed services.

Another good improvement would be service versioning. This matter was
already discussed in another thread:
http://apache-ignite-developers.2346864.n4.nabble.com/Service-versioning-td20858.html
Let's resume this discussion and state the final decision here.
This feature is closely connected to peer class loading, which is not
working for services currently.
So, service versioning should be implemented along with peer class loading.
JIRA ticket for versioning:
https://issues.apache.org/jira/browse/IGNITE-6069
Peer class loading: https://issues.apache.org/jira/browse/IGNITE-975

Please share your thoughts. Constructive criticism is highly appreciated.

Denis

Reply via email to