Re: Operation block on Cluster recovery/rebalance.

Denis Magda Fri, 14 Aug 2020 09:53:16 -0700

John,

Ok, we nailed it. That's the current expected behavior. Generally, I agree
with you that the platform should support an option when operations fail if
the cluster is deactivated. Could you propose the change by starting a
discussion on the dev list? You can refer to this user list discussion for
reference. Let me know if you need help with this.


-
Denis


On Thu, Aug 13, 2020 at 5:55 PM John Smith <[email protected]> wrote:

> No I, reuse the instance. The cache instance is created once at startup of
> the application and I pass it to my "repository" class
>
> public abstract class AbstractIgniteRepository<K,V> implements 
> CacheRepository<K, V> {
>     public final long DEFAULT_OPERATION_TIMEOUT = 2000;
>
>     private Vertx vertx;
>     private IgniteCache<K, V> cache;
>
>     AbstractIgniteRepository(Vertx vertx, IgniteCache<K, V> cache) {
>         this.vertx = vertx;
>         this.cache = cache;
>     }
>
> ...
>
>     Future<List<JsonArray>> query(final String sql, final long timeoutMs, 
> final Object... args) {
>         final Promise<List<JsonArray>> promise = Promise.promise();
>
>         vertx.setTimer(timeoutMs, l -> {
>             promise.tryFail(new TimeoutException("Cache operation did not 
> complete within: " + timeoutMs + " Ms.")); // THIS FIRE IF THE BLOE DOESN"T 
> COMPLETE IN TIME.
>         });
>
>         vertx.<List<JsonArray>>executeBlocking(code -> {
>             SqlFieldsQuery query = new SqlFieldsQuery(sql).setArgs(args);
>             query.setTimeout((int) timeoutMs, TimeUnit.MILLISECONDS);
>
>
>             try (QueryCursor<List<?>> cursor = cache.query(query)) { // <--- 
> BLOCKS HERE.
>                 List<JsonArray> rows = new ArrayList<>();
>                 Iterator<List<?>> iterator = cursor.iterator();
>
>                 while(iterator.hasNext()) {
>                     List currentRow = iterator.next();
>                     JsonArray row = new JsonArray();
>
>                     currentRow.forEach(o -> row.add(o));
>
>                     rows.add(row);
>                 }
>
>                 code.complete(rows);
>             } catch(Exception ex) {
>                 code.fail(ex);
>             }
>         }, result -> {
>             if(result.succeeded()) {
>                 promise.tryComplete(result.result());
>             } else {
>                 promise.tryFail(result.cause());
>             }
>         });
>
>         return promise.future();
>     }
>
>     public <T> T cache() {
>         return (T) cache;
>     }
> }
>
>
>
> On Thu, 13 Aug 2020 at 16:29, Denis Magda <[email protected]> wrote:
>
>> I've created a simple test and always getting the exception below on an
>> attempt to get a reference to an IgniteCache instance in cases when the
>> cluster is not activated:
>>
>> *Exception in thread "main" class org.apache.ignite.IgniteException: Can
>> not perform the operation because the cluster is inactive. Note, that the
>> cluster is considered inactive by default if Ignite Persistent Store is
>> used to let all the nodes join the cluster. To activate the cluster call
>> Ignite.active(true)*
>>
>> Are you trying to get a new IgniteCache reference whenever the client
>> reconnects successfully to the cluster? My guts feel that currently, Ignite
>> verifies the activation status and generates the exception above whenever
>> you're getting a reference to an IgniteCache or IgniteCompute. But once you
>> got those references and try to run some operations then those get stuck if
>> the cluster is not activated.
>> -
>> Denis
>>
>>
>> On Thu, Aug 13, 2020 at 6:37 AM John Smith <[email protected]>
>> wrote:
>>
>>> The cache.query() starts to block when ignite server nodes are being
>>> restarted and there's no baseline topology yet. The server nodes do not
>>> block. It's the client that blocks.
>>>
>>> The dumpfiles are of the server nodes. The screen shot is from the
>>> client app using your kit profiler on the client side the threads are
>>> marked as red on your kit.
>>>
>>> The app is simple, make http request, it runs cache Sql query on ignite
>>> and if it succeeds does a put back to ignite.
>>>
>>> The Client disconnected exception only happens when all server nodes in
>>> the cluster are down. The blockage only happens when the cluster is trying
>>> to establish baseline topology.
>>>
>>> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <[email protected]>
>>> wrote:
>>>
>>>> John,
>>>>
>>>> I don't see any traits of an application-caused deadlock in the thread
>>>> dumps. Please elaborate on the following:
>>>>
>>>> 7- Restart 1st node, run operation, operation fails with
>>>>> ClientDisconectedException but application still able to complete it's
>>>>> request.
>>>>
>>>>
>>>> What's the IP address of the server node the client app uses to join
>>>> the cluster? If that's not the address of the 1st node, that is already
>>>> restarted, then the client couldn't join the cluster and it's expected that
>>>> it fails with the ClientDisconnectedException.
>>>>
>>>> 8- Start 2nd node, run operation, from here on all operations just
>>>>> block.
>>>>
>>>>
>>>> Are the operations unblocked and completed successfully when the third
>>>> node joins the cluster and the cluster gets activated automatically?
>>>>
>>>> -
>>>> Denis
>>>>
>>>>
>>>> On Wed, Aug 12, 2020 at 11:08 AM John Smith <[email protected]>
>>>> wrote:
>>>>
>>>>> Ok Denis here they are...
>>>>>
>>>>> 3 nodes and I capture a yourlit screenshot of what it thinks are
>>>>> deadlocks on the client app.
>>>>>
>>>>>
>>>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>>>>>
>>>>> On Wed, 12 Aug 2020 at 11:07, John Smith <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Denis. I will asap but you I think you were right it is the query
>>>>>> that blocks.
>>>>>>
>>>>>> My application first first runs a select on the cache and then does a
>>>>>> put to cache.
>>>>>>
>>>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <[email protected]> wrote:
>>>>>>
>>>>>>> John,
>>>>>>>
>>>>>>> It sounds like a deadlock caused by the application logic. Is there
>>>>>>> any chance that the operation you run on step 8 accesses several keys in
>>>>>>> one order while the other operations work with the same keys but in a
>>>>>>> different order. The deadlocks are possible when you use Ignite 
>>>>>>> Transaction
>>>>>>> API or simply execute bulk operations such as cache.readAll() or
>>>>>>> cache.writeAll(..).
>>>>>>>
>>>>>>> Please take and attach thread dumps from all the cluster nodes for
>>>>>>> analysis if we need to dig deeper.
>>>>>>>
>>>>>>> -
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Denis, I think you are right. It's the query that blocks the
>>>>>>>> other k/v operations are ok.
>>>>>>>>
>>>>>>>> Any thoughts on this?
>>>>>>>>
>>>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I tried with 2.8.1, same issue. Operations block indefinitely...
>>>>>>>>>
>>>>>>>>> 1- Start 3 node cluster
>>>>>>>>> 2- Start client application client = true with Ignition.start()
>>>>>>>>> 3- Run some cache operations, everything ok...
>>>>>>>>> 4- Shut down one node, run operation, still ok
>>>>>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>>>>>> 6- Shut down 3rd node, run operation, still ok... Operations start
>>>>>>>>> failing with ClientDisconectedException...
>>>>>>>>> 7- Restart 1st node, run operation, operation fails
>>>>>>>>> with ClientDisconectedException but application still able to 
>>>>>>>>> complete it's
>>>>>>>>> request.
>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations just
>>>>>>>>> block.
>>>>>>>>>
>>>>>>>>> Basically the client application is an HTTP Server on each HTTP
>>>>>>>>> request does cache exception.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>>>>>
>>>>>>>>>> Only time I get exception is if the cluster is completely off,
>>>>>>>>>> then I get ClientDisconectedException...
>>>>>>>>>>
>>>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put) and
>>>>>>>>>>> compute calls fail with an exception if the cluster is deactivated. 
>>>>>>>>>>> Do
>>>>>>>>>>> those fail on your end?
>>>>>>>>>>>
>>>>>>>>>>> As for the async and SQL operations, let's see what other
>>>>>>>>>>> community members say.
>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>> Denis
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi any thoughts on this?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>>>>>         "select * from my_table")
>>>>>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>>>>>
>>>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>>>>>
>>>>>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a way to timeout and at least have the application
>>>>>>>>>>>>> continue and respond with an appropriate message?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I reboot a node and it begins to rejoin the cluster or
>>>>>>>>>>>>>> the cluster is not yet activated with baseline topology 
>>>>>>>>>>>>>> operations seem to
>>>>>>>>>>>>>> block forever, operations that are supposed to return 
>>>>>>>>>>>>>> IgniteFuture. I.e:
>>>>>>>>>>>>>> putAsync, getAsync etc... They just block, until the cluster 
>>>>>>>>>>>>>> resolves it's
>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: Operation block on Cluster recovery/rebalance.

Reply via email to