Re: Operation block on Cluster recovery/rebalance.

John Smith Fri, 14 Aug 2020 16:14:47 -0700

Yeah I can maybe use vertx event bus or something to do this... But now I
have to tie the ignite instance to the IgniteCahe repository I wrote.


But on client reconnect, doesn't it mean it will still block until the
cluster is active even if I get new IgniteCache instance?

On Fri, 14 Aug 2020 at 18:22, Denis Magda <[email protected]> wrote:

> @Evgenii Zhuravlev <[email protected]>, @Ilya Kasnacheev
> <[email protected]>, any thoughts on this?
>
> As a dirty workaround, you can update your cache references on client
> reconnect events. You will be getting an exception by calling
> ignite.cache(cacheName) in the time when the cluster is not activated yet.
> Does this work for you?
>
> -
> Denis
>
>
> On Fri, Aug 14, 2020 at 3:12 PM John Smith <[email protected]> wrote:
>
>> Is there any work around? I can't have an HTTP server block on all
>> requests.
>>
>> 1- I need to figure out why I lose a server nodes every few weeks, which
>> when rebooting the nodes cause the inactive state until they are back....
>>
>> 2- Implement some kind of logic on the client side not to block the HTTP
>> part...
>>
>> Can IgniteCache instance be notified of disconnected events so I can
>> maybe tell the repository class I have to set a flag to skip the operation?
>>
>>
>> On Fri., Aug. 14, 2020, 5:17 p.m. Denis Magda, <[email protected]> wrote:
>>
>>> My guess that it's standard behavior for all operations (SQL, key-value,
>>> compute, etc.). But I'll let the maintainers of those modules clarify.
>>>
>>> -
>>> Denis
>>>
>>>
>>> On Fri, Aug 14, 2020 at 1:44 PM John Smith <[email protected]>
>>> wrote:
>>>
>>>> Hi Denis, so to understand it's all operations or just the query?
>>>>
>>>> On Fri., Aug. 14, 2020, 12:53 p.m. Denis Magda, <[email protected]>
>>>> wrote:
>>>>
>>>>> John,
>>>>>
>>>>> Ok, we nailed it. That's the current expected behavior. Generally, I
>>>>> agree with you that the platform should support an option when operations
>>>>> fail if the cluster is deactivated. Could you propose the change by
>>>>> starting a discussion on the dev list? You can refer to this user list
>>>>> discussion for reference. Let me know if you need help with this.
>>>>>
>>>>> -
>>>>> Denis
>>>>>
>>>>>
>>>>> On Thu, Aug 13, 2020 at 5:55 PM John Smith <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> No I, reuse the instance. The cache instance is created once at
>>>>>> startup of the application and I pass it to my "repository" class
>>>>>>
>>>>>> public abstract class AbstractIgniteRepository<K,V> implements 
>>>>>> CacheRepository<K, V> {
>>>>>>     public final long DEFAULT_OPERATION_TIMEOUT = 2000;
>>>>>>
>>>>>>     private Vertx vertx;
>>>>>>     private IgniteCache<K, V> cache;
>>>>>>
>>>>>>     AbstractIgniteRepository(Vertx vertx, IgniteCache<K, V> cache) {
>>>>>>         this.vertx = vertx;
>>>>>>         this.cache = cache;
>>>>>>     }
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>     Future<List<JsonArray>> query(final String sql, final long 
>>>>>> timeoutMs, final Object... args) {
>>>>>>         final Promise<List<JsonArray>> promise = Promise.promise();
>>>>>>
>>>>>>         vertx.setTimer(timeoutMs, l -> {
>>>>>>             promise.tryFail(new TimeoutException("Cache operation did 
>>>>>> not complete within: " + timeoutMs + " Ms.")); // THIS FIRE IF THE BLOE 
>>>>>> DOESN"T COMPLETE IN TIME.
>>>>>>         });
>>>>>>
>>>>>>         vertx.<List<JsonArray>>executeBlocking(code -> {
>>>>>>             SqlFieldsQuery query = new SqlFieldsQuery(sql).setArgs(args);
>>>>>>             query.setTimeout((int) timeoutMs, TimeUnit.MILLISECONDS);
>>>>>>
>>>>>>
>>>>>>             try (QueryCursor<List<?>> cursor = cache.query(query)) { // 
>>>>>> <--- BLOCKS HERE.
>>>>>>                 List<JsonArray> rows = new ArrayList<>();
>>>>>>                 Iterator<List<?>> iterator = cursor.iterator();
>>>>>>
>>>>>>                 while(iterator.hasNext()) {
>>>>>>                     List currentRow = iterator.next();
>>>>>>                     JsonArray row = new JsonArray();
>>>>>>
>>>>>>                     currentRow.forEach(o -> row.add(o));
>>>>>>
>>>>>>                     rows.add(row);
>>>>>>                 }
>>>>>>
>>>>>>                 code.complete(rows);
>>>>>>             } catch(Exception ex) {
>>>>>>                 code.fail(ex);
>>>>>>             }
>>>>>>         }, result -> {
>>>>>>             if(result.succeeded()) {
>>>>>>                 promise.tryComplete(result.result());
>>>>>>             } else {
>>>>>>                 promise.tryFail(result.cause());
>>>>>>             }
>>>>>>         });
>>>>>>
>>>>>>         return promise.future();
>>>>>>     }
>>>>>>
>>>>>>     public <T> T cache() {
>>>>>>         return (T) cache;
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 13 Aug 2020 at 16:29, Denis Magda <[email protected]> wrote:
>>>>>>
>>>>>>> I've created a simple test and always getting the exception below on
>>>>>>> an attempt to get a reference to an IgniteCache instance in cases when 
>>>>>>> the
>>>>>>> cluster is not activated:
>>>>>>>
>>>>>>> *Exception in thread "main" class org.apache.ignite.IgniteException:
>>>>>>> Can not perform the operation because the cluster is inactive. Note, 
>>>>>>> that
>>>>>>> the cluster is considered inactive by default if Ignite Persistent 
>>>>>>> Store is
>>>>>>> used to let all the nodes join the cluster. To activate the cluster call
>>>>>>> Ignite.active(true)*
>>>>>>>
>>>>>>> Are you trying to get a new IgniteCache reference whenever the
>>>>>>> client reconnects successfully to the cluster? My guts feel that 
>>>>>>> currently,
>>>>>>> Ignite verifies the activation status and generates the exception above
>>>>>>> whenever you're getting a reference to an IgniteCache or IgniteCompute. 
>>>>>>> But
>>>>>>> once you got those references and try to run some operations then those 
>>>>>>> get
>>>>>>> stuck if the cluster is not activated.
>>>>>>> -
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020 at 6:37 AM John Smith <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The cache.query() starts to block when ignite server nodes are
>>>>>>>> being restarted and there's no baseline topology yet. The server nodes 
>>>>>>>> do
>>>>>>>> not block. It's the client that blocks.
>>>>>>>>
>>>>>>>> The dumpfiles are of the server nodes. The screen shot is from the
>>>>>>>> client app using your kit profiler on the client side the threads are
>>>>>>>> marked as red on your kit.
>>>>>>>>
>>>>>>>> The app is simple, make http request, it runs cache Sql query on
>>>>>>>> ignite and if it succeeds does a put back to ignite.
>>>>>>>>
>>>>>>>> The Client disconnected exception only happens when all server
>>>>>>>> nodes in the cluster are down. The blockage only happens when the 
>>>>>>>> cluster
>>>>>>>> is trying to establish baseline topology.
>>>>>>>>
>>>>>>>> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> John,
>>>>>>>>>
>>>>>>>>> I don't see any traits of an application-caused deadlock in the
>>>>>>>>> thread dumps. Please elaborate on the following:
>>>>>>>>>
>>>>>>>>> 7- Restart 1st node, run operation, operation fails with
>>>>>>>>>> ClientDisconectedException but application still able to complete 
>>>>>>>>>> it's
>>>>>>>>>> request.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What's the IP address of the server node the client app uses to
>>>>>>>>> join the cluster? If that's not the address of the 1st node, that is
>>>>>>>>> already restarted, then the client couldn't join the cluster and it's
>>>>>>>>> expected that it fails with the ClientDisconnectedException.
>>>>>>>>>
>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations just
>>>>>>>>>> block.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are the operations unblocked and completed successfully when the
>>>>>>>>> third node joins the cluster and the cluster gets activated 
>>>>>>>>> automatically?
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 12, 2020 at 11:08 AM John Smith <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Ok Denis here they are...
>>>>>>>>>>
>>>>>>>>>> 3 nodes and I capture a yourlit screenshot of what it thinks are
>>>>>>>>>> deadlocks on the client app.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>>>>>>>>>>
>>>>>>>>>> On Wed, 12 Aug 2020 at 11:07, John Smith <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Denis. I will asap but you I think you were right it is the
>>>>>>>>>>> query that blocks.
>>>>>>>>>>>
>>>>>>>>>>> My application first first runs a select on the cache and then
>>>>>>>>>>> does a put to cache.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> John,
>>>>>>>>>>>>
>>>>>>>>>>>> It sounds like a deadlock caused by the application logic. Is
>>>>>>>>>>>> there any chance that the operation you run on step 8 accesses 
>>>>>>>>>>>> several keys
>>>>>>>>>>>> in one order while the other operations work with the same keys 
>>>>>>>>>>>> but in a
>>>>>>>>>>>> different order. The deadlocks are possible when you use Ignite 
>>>>>>>>>>>> Transaction
>>>>>>>>>>>> API or simply execute bulk operations such as cache.readAll() or
>>>>>>>>>>>> cache.writeAll(..).
>>>>>>>>>>>>
>>>>>>>>>>>> Please take and attach thread dumps from all the cluster nodes
>>>>>>>>>>>> for analysis if we need to dig deeper.
>>>>>>>>>>>>
>>>>>>>>>>>> -
>>>>>>>>>>>> Denis
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Denis, I think you are right. It's the query that blocks
>>>>>>>>>>>>> the other k/v operations are ok.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried with 2.8.1, same issue. Operations block
>>>>>>>>>>>>>> indefinitely...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1- Start 3 node cluster
>>>>>>>>>>>>>> 2- Start client application client = true with
>>>>>>>>>>>>>> Ignition.start()
>>>>>>>>>>>>>> 3- Run some cache operations, everything ok...
>>>>>>>>>>>>>> 4- Shut down one node, run operation, still ok
>>>>>>>>>>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>>>>>>>>>>> 6- Shut down 3rd node, run operation, still ok...
>>>>>>>>>>>>>> Operations start failing with ClientDisconectedException...
>>>>>>>>>>>>>> 7- Restart 1st node, run operation, operation fails
>>>>>>>>>>>>>> with ClientDisconectedException but application still able to 
>>>>>>>>>>>>>> complete it's
>>>>>>>>>>>>>> request.
>>>>>>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations
>>>>>>>>>>>>>> just block.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Basically the client application is an HTTP Server on each
>>>>>>>>>>>>>> HTTP request does cache exception.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Only time I get exception is if the cluster is
>>>>>>>>>>>>>>> completely off, then I get ClientDisconectedException...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put)
>>>>>>>>>>>>>>>> and compute calls fail with an exception if the cluster is 
>>>>>>>>>>>>>>>> deactivated. Do
>>>>>>>>>>>>>>>> those fail on your end?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As for the async and SQL operations, let's see what other
>>>>>>>>>>>>>>>> community members say.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi any thoughts on this?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>>>>>>>>>>         "select * from my_table")
>>>>>>>>>>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there a way to timeout and at least have the
>>>>>>>>>>>>>>>>>> application continue and respond with an appropriate message?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> When I reboot a node and it begins to rejoin the cluster
>>>>>>>>>>>>>>>>>>> or the cluster is not yet activated with baseline topology 
>>>>>>>>>>>>>>>>>>> operations seem
>>>>>>>>>>>>>>>>>>> to block forever, operations that are supposed to return 
>>>>>>>>>>>>>>>>>>> IgniteFuture. I.e:
>>>>>>>>>>>>>>>>>>> putAsync, getAsync etc... They just block, until the 
>>>>>>>>>>>>>>>>>>> cluster resolves it's
>>>>>>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: Operation block on Cluster recovery/rebalance.

Reply via email to