Re: Operation block on Cluster recovery/rebalance.

John Smith Tue, 18 Aug 2020 08:37:54 -0700

Hi Denis, for everyones reference:
https://issues.apache.org/jira/browse/IGNITE-13372


On Mon, 17 Aug 2020 at 14:28, Denis Magda <[email protected]> wrote:

> But on client reconnect, doesn't it mean it will still block until the
>> cluster is active even if I get new IgniteCache instance?
>
>
> No, the client will be getting an exception on an attempt to get an
> IgniteCache instance.
>
> -
> Denis
>
>
> On Fri, Aug 14, 2020 at 4:14 PM John Smith <[email protected]> wrote:
>
>> Yeah I can maybe use vertx event bus or something to do this... But now I
>> have to tie the ignite instance to the IgniteCahe repository I wrote.
>>
>> But on client reconnect, doesn't it mean it will still block until the
>> cluster is active even if I get new IgniteCache instance?
>>
>> On Fri, 14 Aug 2020 at 18:22, Denis Magda <[email protected]> wrote:
>>
>>> @Evgenii Zhuravlev <[email protected]>, @Ilya Kasnacheev
>>> <[email protected]>, any thoughts on this?
>>>
>>> As a dirty workaround, you can update your cache references on client
>>> reconnect events. You will be getting an exception by calling
>>> ignite.cache(cacheName) in the time when the cluster is not activated yet.
>>> Does this work for you?
>>>
>>> -
>>> Denis
>>>
>>>
>>> On Fri, Aug 14, 2020 at 3:12 PM John Smith <[email protected]>
>>> wrote:
>>>
>>>> Is there any work around? I can't have an HTTP server block on all
>>>> requests.
>>>>
>>>> 1- I need to figure out why I lose a server nodes every few weeks,
>>>> which when rebooting the nodes cause the inactive state until they are
>>>> back....
>>>>
>>>> 2- Implement some kind of logic on the client side not to block the
>>>> HTTP part...
>>>>
>>>> Can IgniteCache instance be notified of disconnected events so I can
>>>> maybe tell the repository class I have to set a flag to skip the operation?
>>>>
>>>>
>>>> On Fri., Aug. 14, 2020, 5:17 p.m. Denis Magda, <[email protected]>
>>>> wrote:
>>>>
>>>>> My guess that it's standard behavior for all operations (SQL,
>>>>> key-value, compute, etc.). But I'll let the maintainers of those modules
>>>>> clarify.
>>>>>
>>>>> -
>>>>> Denis
>>>>>
>>>>>
>>>>> On Fri, Aug 14, 2020 at 1:44 PM John Smith <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Denis, so to understand it's all operations or just the query?
>>>>>>
>>>>>> On Fri., Aug. 14, 2020, 12:53 p.m. Denis Magda, <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> John,
>>>>>>>
>>>>>>> Ok, we nailed it. That's the current expected behavior. Generally, I
>>>>>>> agree with you that the platform should support an option when 
>>>>>>> operations
>>>>>>> fail if the cluster is deactivated. Could you propose the change by
>>>>>>> starting a discussion on the dev list? You can refer to this user list
>>>>>>> discussion for reference. Let me know if you need help with this.
>>>>>>>
>>>>>>> -
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020 at 5:55 PM John Smith <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> No I, reuse the instance. The cache instance is created once at
>>>>>>>> startup of the application and I pass it to my "repository" class
>>>>>>>>
>>>>>>>> public abstract class AbstractIgniteRepository<K,V> implements 
>>>>>>>> CacheRepository<K, V> {
>>>>>>>>     public final long DEFAULT_OPERATION_TIMEOUT = 2000;
>>>>>>>>
>>>>>>>>     private Vertx vertx;
>>>>>>>>     private IgniteCache<K, V> cache;
>>>>>>>>
>>>>>>>>     AbstractIgniteRepository(Vertx vertx, IgniteCache<K, V> cache) {
>>>>>>>>         this.vertx = vertx;
>>>>>>>>         this.cache = cache;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>     Future<List<JsonArray>> query(final String sql, final long 
>>>>>>>> timeoutMs, final Object... args) {
>>>>>>>>         final Promise<List<JsonArray>> promise = Promise.promise();
>>>>>>>>
>>>>>>>>         vertx.setTimer(timeoutMs, l -> {
>>>>>>>>             promise.tryFail(new TimeoutException("Cache operation did 
>>>>>>>> not complete within: " + timeoutMs + " Ms.")); // THIS FIRE IF THE 
>>>>>>>> BLOE DOESN"T COMPLETE IN TIME.
>>>>>>>>         });
>>>>>>>>
>>>>>>>>         vertx.<List<JsonArray>>executeBlocking(code -> {
>>>>>>>>             SqlFieldsQuery query = new 
>>>>>>>> SqlFieldsQuery(sql).setArgs(args);
>>>>>>>>             query.setTimeout((int) timeoutMs, TimeUnit.MILLISECONDS);
>>>>>>>>
>>>>>>>>
>>>>>>>>             try (QueryCursor<List<?>> cursor = cache.query(query)) { 
>>>>>>>> // <--- BLOCKS HERE.
>>>>>>>>                 List<JsonArray> rows = new ArrayList<>();
>>>>>>>>                 Iterator<List<?>> iterator = cursor.iterator();
>>>>>>>>
>>>>>>>>                 while(iterator.hasNext()) {
>>>>>>>>                     List currentRow = iterator.next();
>>>>>>>>                     JsonArray row = new JsonArray();
>>>>>>>>
>>>>>>>>                     currentRow.forEach(o -> row.add(o));
>>>>>>>>
>>>>>>>>                     rows.add(row);
>>>>>>>>                 }
>>>>>>>>
>>>>>>>>                 code.complete(rows);
>>>>>>>>             } catch(Exception ex) {
>>>>>>>>                 code.fail(ex);
>>>>>>>>             }
>>>>>>>>         }, result -> {
>>>>>>>>             if(result.succeeded()) {
>>>>>>>>                 promise.tryComplete(result.result());
>>>>>>>>             } else {
>>>>>>>>                 promise.tryFail(result.cause());
>>>>>>>>             }
>>>>>>>>         });
>>>>>>>>
>>>>>>>>         return promise.future();
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     public <T> T cache() {
>>>>>>>>         return (T) cache;
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 13 Aug 2020 at 16:29, Denis Magda <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I've created a simple test and always getting the exception below
>>>>>>>>> on an attempt to get a reference to an IgniteCache instance in cases 
>>>>>>>>> when
>>>>>>>>> the cluster is not activated:
>>>>>>>>>
>>>>>>>>> *Exception in thread "main" class
>>>>>>>>> org.apache.ignite.IgniteException: Can not perform the operation 
>>>>>>>>> because
>>>>>>>>> the cluster is inactive. Note, that the cluster is considered 
>>>>>>>>> inactive by
>>>>>>>>> default if Ignite Persistent Store is used to let all the nodes join 
>>>>>>>>> the
>>>>>>>>> cluster. To activate the cluster call Ignite.active(true)*
>>>>>>>>>
>>>>>>>>> Are you trying to get a new IgniteCache reference whenever the
>>>>>>>>> client reconnects successfully to the cluster? My guts feel that 
>>>>>>>>> currently,
>>>>>>>>> Ignite verifies the activation status and generates the exception 
>>>>>>>>> above
>>>>>>>>> whenever you're getting a reference to an IgniteCache or 
>>>>>>>>> IgniteCompute. But
>>>>>>>>> once you got those references and try to run some operations then 
>>>>>>>>> those get
>>>>>>>>> stuck if the cluster is not activated.
>>>>>>>>> -
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 13, 2020 at 6:37 AM John Smith <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The cache.query() starts to block when ignite server nodes are
>>>>>>>>>> being restarted and there's no baseline topology yet. The server 
>>>>>>>>>> nodes do
>>>>>>>>>> not block. It's the client that blocks.
>>>>>>>>>>
>>>>>>>>>> The dumpfiles are of the server nodes. The screen shot is from
>>>>>>>>>> the client app using your kit profiler on the client side the 
>>>>>>>>>> threads are
>>>>>>>>>> marked as red on your kit.
>>>>>>>>>>
>>>>>>>>>> The app is simple, make http request, it runs cache Sql query on
>>>>>>>>>> ignite and if it succeeds does a put back to ignite.
>>>>>>>>>>
>>>>>>>>>> The Client disconnected exception only happens when all server
>>>>>>>>>> nodes in the cluster are down. The blockage only happens when the 
>>>>>>>>>> cluster
>>>>>>>>>> is trying to establish baseline topology.
>>>>>>>>>>
>>>>>>>>>> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> John,
>>>>>>>>>>>
>>>>>>>>>>> I don't see any traits of an application-caused deadlock in the
>>>>>>>>>>> thread dumps. Please elaborate on the following:
>>>>>>>>>>>
>>>>>>>>>>> 7- Restart 1st node, run operation, operation fails with
>>>>>>>>>>>> ClientDisconectedException but application still able to complete 
>>>>>>>>>>>> it's
>>>>>>>>>>>> request.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What's the IP address of the server node the client app uses to
>>>>>>>>>>> join the cluster? If that's not the address of the 1st node, that is
>>>>>>>>>>> already restarted, then the client couldn't join the cluster and 
>>>>>>>>>>> it's
>>>>>>>>>>> expected that it fails with the ClientDisconnectedException.
>>>>>>>>>>>
>>>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations
>>>>>>>>>>>> just block.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Are the operations unblocked and completed successfully when the
>>>>>>>>>>> third node joins the cluster and the cluster gets activated 
>>>>>>>>>>> automatically?
>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>> Denis
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 12, 2020 at 11:08 AM John Smith <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok Denis here they are...
>>>>>>>>>>>>
>>>>>>>>>>>> 3 nodes and I capture a yourlit screenshot of what it thinks
>>>>>>>>>>>> are deadlocks on the client app.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 12 Aug 2020 at 11:07, John Smith <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Denis. I will asap but you I think you were right it is the
>>>>>>>>>>>>> query that blocks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My application first first runs a select on the cache and then
>>>>>>>>>>>>> does a put to cache.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> John,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It sounds like a deadlock caused by the application logic. Is
>>>>>>>>>>>>>> there any chance that the operation you run on step 8 accesses 
>>>>>>>>>>>>>> several keys
>>>>>>>>>>>>>> in one order while the other operations work with the same keys 
>>>>>>>>>>>>>> but in a
>>>>>>>>>>>>>> different order. The deadlocks are possible when you use Ignite 
>>>>>>>>>>>>>> Transaction
>>>>>>>>>>>>>> API or simply execute bulk operations such as cache.readAll() or
>>>>>>>>>>>>>> cache.writeAll(..).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please take and attach thread dumps from all the cluster
>>>>>>>>>>>>>> nodes for analysis if we need to dig deeper.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Denis, I think you are right. It's the query that blocks
>>>>>>>>>>>>>>> the other k/v operations are ok.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I tried with 2.8.1, same issue. Operations block
>>>>>>>>>>>>>>>> indefinitely...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1- Start 3 node cluster
>>>>>>>>>>>>>>>> 2- Start client application client = true with
>>>>>>>>>>>>>>>> Ignition.start()
>>>>>>>>>>>>>>>> 3- Run some cache operations, everything ok...
>>>>>>>>>>>>>>>> 4- Shut down one node, run operation, still ok
>>>>>>>>>>>>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>>>>>>>>>>>>> 6- Shut down 3rd node, run operation, still ok...
>>>>>>>>>>>>>>>> Operations start failing with ClientDisconectedException...
>>>>>>>>>>>>>>>> 7- Restart 1st node, run operation, operation fails
>>>>>>>>>>>>>>>> with ClientDisconectedException but application still able to 
>>>>>>>>>>>>>>>> complete it's
>>>>>>>>>>>>>>>> request.
>>>>>>>>>>>>>>>> 8- Start 2nd node, run operation, from here on all
>>>>>>>>>>>>>>>> operations just block.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Basically the client application is an HTTP Server on each
>>>>>>>>>>>>>>>> HTTP request does cache exception.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Only time I get exception is if the cluster is
>>>>>>>>>>>>>>>>> completely off, then I get ClientDisconectedException...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put)
>>>>>>>>>>>>>>>>>> and compute calls fail with an exception if the cluster is 
>>>>>>>>>>>>>>>>>> deactivated. Do
>>>>>>>>>>>>>>>>>> those fail on your end?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As for the async and SQL operations, let's see what other
>>>>>>>>>>>>>>>>>> community members say.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi any thoughts on this?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>>>>>>>>>>>>         "select * from my_table")
>>>>>>>>>>>>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>>>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Is there a way to timeout and at least have the
>>>>>>>>>>>>>>>>>>>> application continue and respond with an appropriate 
>>>>>>>>>>>>>>>>>>>> message?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> When I reboot a node and it begins to rejoin the
>>>>>>>>>>>>>>>>>>>>> cluster or the cluster is not yet activated with baseline 
>>>>>>>>>>>>>>>>>>>>> topology
>>>>>>>>>>>>>>>>>>>>> operations seem to block forever, operations that are 
>>>>>>>>>>>>>>>>>>>>> supposed to return
>>>>>>>>>>>>>>>>>>>>> IgniteFuture. I.e: putAsync, getAsync etc... They just 
>>>>>>>>>>>>>>>>>>>>> block, until the
>>>>>>>>>>>>>>>>>>>>> cluster resolves it's state.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>

Re: Operation block on Cluster recovery/rebalance.

Reply via email to