Re: Operation block on Cluster recovery/rebalance.

John Smith Fri, 14 Aug 2020 13:44:48 -0700

Hi Denis, so to understand it's all operations or just the query?

On Fri., Aug. 14, 2020, 12:53 p.m. Denis Magda, <[email protected]> wrote:


> John,
>
> Ok, we nailed it. That's the current expected behavior. Generally, I agree
> with you that the platform should support an option when operations fail if
> the cluster is deactivated. Could you propose the change by starting a
> discussion on the dev list? You can refer to this user list discussion for
> reference. Let me know if you need help with this.
>
> -
> Denis
>
>
> On Thu, Aug 13, 2020 at 5:55 PM John Smith <[email protected]> wrote:
>
>> No I, reuse the instance. The cache instance is created once at startup
>> of the application and I pass it to my "repository" class
>>
>> public abstract class AbstractIgniteRepository<K,V> implements 
>> CacheRepository<K, V> {
>>     public final long DEFAULT_OPERATION_TIMEOUT = 2000;
>>
>>     private Vertx vertx;
>>     private IgniteCache<K, V> cache;
>>
>>     AbstractIgniteRepository(Vertx vertx, IgniteCache<K, V> cache) {
>>         this.vertx = vertx;
>>         this.cache = cache;
>>     }
>>
>> ...
>>
>>     Future<List<JsonArray>> query(final String sql, final long timeoutMs, 
>> final Object... args) {
>>         final Promise<List<JsonArray>> promise = Promise.promise();
>>
>>         vertx.setTimer(timeoutMs, l -> {
>>             promise.tryFail(new TimeoutException("Cache operation did not 
>> complete within: " + timeoutMs + " Ms.")); // THIS FIRE IF THE BLOE DOESN"T 
>> COMPLETE IN TIME.
>>         });
>>
>>         vertx.<List<JsonArray>>executeBlocking(code -> {
>>             SqlFieldsQuery query = new SqlFieldsQuery(sql).setArgs(args);
>>             query.setTimeout((int) timeoutMs, TimeUnit.MILLISECONDS);
>>
>>
>>             try (QueryCursor<List<?>> cursor = cache.query(query)) { // <--- 
>> BLOCKS HERE.
>>                 List<JsonArray> rows = new ArrayList<>();
>>                 Iterator<List<?>> iterator = cursor.iterator();
>>
>>                 while(iterator.hasNext()) {
>>                     List currentRow = iterator.next();
>>                     JsonArray row = new JsonArray();
>>
>>                     currentRow.forEach(o -> row.add(o));
>>
>>                     rows.add(row);
>>                 }
>>
>>                 code.complete(rows);
>>             } catch(Exception ex) {
>>                 code.fail(ex);
>>             }
>>         }, result -> {
>>             if(result.succeeded()) {
>>                 promise.tryComplete(result.result());
>>             } else {
>>                 promise.tryFail(result.cause());
>>             }
>>         });
>>
>>         return promise.future();
>>     }
>>
>>     public <T> T cache() {
>>         return (T) cache;
>>     }
>> }
>>
>>
>>
>> On Thu, 13 Aug 2020 at 16:29, Denis Magda <[email protected]> wrote:
>>
>>> I've created a simple test and always getting the exception below on an
>>> attempt to get a reference to an IgniteCache instance in cases when the
>>> cluster is not activated:
>>>
>>> *Exception in thread "main" class org.apache.ignite.IgniteException: Can
>>> not perform the operation because the cluster is inactive. Note, that the
>>> cluster is considered inactive by default if Ignite Persistent Store is
>>> used to let all the nodes join the cluster. To activate the cluster call
>>> Ignite.active(true)*
>>>
>>> Are you trying to get a new IgniteCache reference whenever the client
>>> reconnects successfully to the cluster? My guts feel that currently, Ignite
>>> verifies the activation status and generates the exception above whenever
>>> you're getting a reference to an IgniteCache or IgniteCompute. But once you
>>> got those references and try to run some operations then those get stuck if
>>> the cluster is not activated.
>>> -
>>> Denis
>>>
>>>
>>> On Thu, Aug 13, 2020 at 6:37 AM John Smith <[email protected]>
>>> wrote:
>>>
>>>> The cache.query() starts to block when ignite server nodes are being
>>>> restarted and there's no baseline topology yet. The server nodes do not
>>>> block. It's the client that blocks.
>>>>
>>>> The dumpfiles are of the server nodes. The screen shot is from the
>>>> client app using your kit profiler on the client side the threads are
>>>> marked as red on your kit.
>>>>
>>>> The app is simple, make http request, it runs cache Sql query on ignite
>>>> and if it succeeds does a put back to ignite.
>>>>
>>>> The Client disconnected exception only happens when all server nodes in
>>>> the cluster are down. The blockage only happens when the cluster is trying
>>>> to establish baseline topology.
>>>>
>>>> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <[email protected]>
>>>> wrote:
>>>>
>>>>> John,
>>>>>
>>>>> I don't see any traits of an application-caused deadlock in the thread
>>>>> dumps. Please elaborate on the following:
>>>>>
>>>>> 7- Restart 1st node, run operation, operation fails with
>>>>>> ClientDisconectedException but application still able to complete it's
>>>>>> request.
>>>>>
>>>>>
>>>>> What's the IP address of the server node the client app uses to join
>>>>> the cluster? If that's not the address of the 1st node, that is already
>>>>> restarted, then the client couldn't join the cluster and it's expected 
>>>>> that
>>>>> it fails with the ClientDisconnectedException.
>>>>>
>>>>> 8- Start 2nd node, run operation, from here on all operations just
>>>>>> block.
>>>>>
>>>>>
>>>>> Are the operations unblocked and completed successfully when the third
>>>>> node joins the cluster and the cluster gets activated automatically?
>>>>>
>>>>> -
>>>>> Denis
>>>>>
>>>>>
>>>>> On Wed, Aug 12, 2020 at 11:08 AM John Smith <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Ok Denis here they are...
>>>>>>
>>>>>> 3 nodes and I capture a yourlit screenshot of what it thinks are
>>>>>> deadlocks on the client app.
>>>>>>
>>>>>>
>>>>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>>>>>>
>>>>>> On Wed, 12 Aug 2020 at 11:07, John Smith <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Denis. I will asap but you I think you were right it is the query
>>>>>>> that blocks.
>>>>>>>
>>>>>>> My application first first runs a select on the cache and then does
>>>>>>> a put to cache.
>>>>>>>
>>>>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <[email protected]> wrote:
>>>>>>>
>>>>>>>> John,
>>>>>>>>
>>>>>>>> It sounds like a deadlock caused by the application logic. Is there
>>>>>>>> any chance that the operation you run on step 8 accesses several keys 
>>>>>>>> in
>>>>>>>> one order while the other operations work with the same keys but in a
>>>>>>>> different order. The deadlocks are possible when you use Ignite 
>>>>>>>> Transaction
>>>>>>>> API or simply execute bulk operations such as cache.readAll() or
>>>>>>>> cache.writeAll(..).
>>>>>>>>
>>>>>>>> Please take and attach thread dumps from all the cluster nodes for
>>>>>>>> analysis if we need to dig deeper.
>>>>>>>>
>>>>>>>> -
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Denis, I think you are right. It's the query that blocks the
>>>>>>>>> other k/v operations are ok.
>>>>>>>>>
>>>>>>>>> Any thoughts on this?
>>>>>>>>>
>>>>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I tried with 2.8.1, same issue. Operations block indefinitely...
>>>>>>>>>>
>>>>>>>>>> 1- Start 3 node cluster
>>>>>>>>>> 2- Start client application client = true with Ignition.start()
>>>>>>>>>> 3- Run some cache operations, everything ok...
>>>>>>>>>> 4- Shut down one node, run operation, still ok
>>>>>>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>>>>>>> 6- Shut down 3rd node, run operation, still ok...
>>>>>>>>>> Operations start failing with ClientDisconectedException...
>>>>>>>>>> 7- Restart 1st node, run operation, operation fails
>>>>>>>>>> with ClientDisconectedException but application still able to 
>>>>>>>>>> complete it's
>>>>>>>>>> request.
>>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations
>>>>>>>>>> just block.
>>>>>>>>>>
>>>>>>>>>> Basically the client application is an HTTP Server on each HTTP
>>>>>>>>>> request does cache exception.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>>>>>>
>>>>>>>>>>> Only time I get exception is if the cluster is completely off,
>>>>>>>>>>> then I get ClientDisconectedException...
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put) and
>>>>>>>>>>>> compute calls fail with an exception if the cluster is 
>>>>>>>>>>>> deactivated. Do
>>>>>>>>>>>> those fail on your end?
>>>>>>>>>>>>
>>>>>>>>>>>> As for the async and SQL operations, let's see what other
>>>>>>>>>>>> community members say.
>>>>>>>>>>>>
>>>>>>>>>>>> -
>>>>>>>>>>>> Denis
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi any thoughts on this?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>>>>>>         "select * from my_table")
>>>>>>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a way to timeout and at least have the application
>>>>>>>>>>>>>> continue and respond with an appropriate message?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When I reboot a node and it begins to rejoin the cluster or
>>>>>>>>>>>>>>> the cluster is not yet activated with baseline topology 
>>>>>>>>>>>>>>> operations seem to
>>>>>>>>>>>>>>> block forever, operations that are supposed to return 
>>>>>>>>>>>>>>> IgniteFuture. I.e:
>>>>>>>>>>>>>>> putAsync, getAsync etc... They just block, until the cluster 
>>>>>>>>>>>>>>> resolves it's
>>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: Operation block on Cluster recovery/rebalance.

Reply via email to