Re: Operation block on Cluster recovery/rebalance.

Denis Magda Thu, 13 Aug 2020 13:29:56 -0700

I've created a simple test and always getting the exception below on an
attempt to get a reference to an IgniteCache instance in cases when the
cluster is not activated:


*Exception in thread "main" class org.apache.ignite.IgniteException: Can
not perform the operation because the cluster is inactive. Note, that the
cluster is considered inactive by default if Ignite Persistent Store is
used to let all the nodes join the cluster. To activate the cluster call
Ignite.active(true)*

Are you trying to get a new IgniteCache reference whenever the client
reconnects successfully to the cluster? My guts feel that currently, Ignite
verifies the activation status and generates the exception above whenever
you're getting a reference to an IgniteCache or IgniteCompute. But once you
got those references and try to run some operations then those get stuck if
the cluster is not activated.
-
Denis


On Thu, Aug 13, 2020 at 6:37 AM John Smith <[email protected]> wrote:

> The cache.query() starts to block when ignite server nodes are being
> restarted and there's no baseline topology yet. The server nodes do not
> block. It's the client that blocks.
>
> The dumpfiles are of the server nodes. The screen shot is from the client
> app using your kit profiler on the client side the threads are marked as
> red on your kit.
>
> The app is simple, make http request, it runs cache Sql query on ignite
> and if it succeeds does a put back to ignite.
>
> The Client disconnected exception only happens when all server nodes in
> the cluster are down. The blockage only happens when the cluster is trying
> to establish baseline topology.
>
> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <[email protected]> wrote:
>
>> John,
>>
>> I don't see any traits of an application-caused deadlock in the thread
>> dumps. Please elaborate on the following:
>>
>> 7- Restart 1st node, run operation, operation fails with
>>> ClientDisconectedException but application still able to complete it's
>>> request.
>>
>>
>> What's the IP address of the server node the client app uses to join the
>> cluster? If that's not the address of the 1st node, that is already
>> restarted, then the client couldn't join the cluster and it's expected that
>> it fails with the ClientDisconnectedException.
>>
>> 8- Start 2nd node, run operation, from here on all operations just block.
>>
>>
>> Are the operations unblocked and completed successfully when the third
>> node joins the cluster and the cluster gets activated automatically?
>>
>> -
>> Denis
>>
>>
>> On Wed, Aug 12, 2020 at 11:08 AM John Smith <[email protected]>
>> wrote:
>>
>>> Ok Denis here they are...
>>>
>>> 3 nodes and I capture a yourlit screenshot of what it thinks are
>>> deadlocks on the client app.
>>>
>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>>>
>>> On Wed, 12 Aug 2020 at 11:07, John Smith <[email protected]> wrote:
>>>
>>>> Hi Denis. I will asap but you I think you were right it is the query
>>>> that blocks.
>>>>
>>>> My application first first runs a select on the cache and then does a
>>>> put to cache.
>>>>
>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <[email protected]> wrote:
>>>>
>>>>> John,
>>>>>
>>>>> It sounds like a deadlock caused by the application logic. Is there
>>>>> any chance that the operation you run on step 8 accesses several keys in
>>>>> one order while the other operations work with the same keys but in a
>>>>> different order. The deadlocks are possible when you use Ignite 
>>>>> Transaction
>>>>> API or simply execute bulk operations such as cache.readAll() or
>>>>> cache.writeAll(..).
>>>>>
>>>>> Please take and attach thread dumps from all the cluster nodes for
>>>>> analysis if we need to dig deeper.
>>>>>
>>>>> -
>>>>> Denis
>>>>>
>>>>>
>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Denis, I think you are right. It's the query that blocks the other
>>>>>> k/v operations are ok.
>>>>>>
>>>>>> Any thoughts on this?
>>>>>>
>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I tried with 2.8.1, same issue. Operations block indefinitely...
>>>>>>>
>>>>>>> 1- Start 3 node cluster
>>>>>>> 2- Start client application client = true with Ignition.start()
>>>>>>> 3- Run some cache operations, everything ok...
>>>>>>> 4- Shut down one node, run operation, still ok
>>>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>>>> 6- Shut down 3rd node, run operation, still ok... Operations start
>>>>>>> failing with ClientDisconectedException...
>>>>>>> 7- Restart 1st node, run operation, operation fails
>>>>>>> with ClientDisconectedException but application still able to complete 
>>>>>>> it's
>>>>>>> request.
>>>>>>> 8- Start 2nd node, run operation, from here on all operations just
>>>>>>> block.
>>>>>>>
>>>>>>> Basically the client application is an HTTP Server on each HTTP
>>>>>>> request does cache exception.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>>>
>>>>>>>> Only time I get exception is if the cluster is completely off, then
>>>>>>>> I get ClientDisconectedException...
>>>>>>>>
>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put) and
>>>>>>>>> compute calls fail with an exception if the cluster is deactivated. Do
>>>>>>>>> those fail on your end?
>>>>>>>>>
>>>>>>>>> As for the async and SQL operations, let's see what other
>>>>>>>>> community members say.
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi any thoughts on this?
>>>>>>>>>>
>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>>>
>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>>>         "select * from my_table")
>>>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>>>
>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>>>
>>>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>>>
>>>>>>>>>>> Is there a way to timeout and at least have the application
>>>>>>>>>>> continue and respond with an appropriate message?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>>>
>>>>>>>>>>>> When I reboot a node and it begins to rejoin the cluster or the
>>>>>>>>>>>> cluster is not yet activated with baseline topology operations 
>>>>>>>>>>>> seem to
>>>>>>>>>>>> block forever, operations that are supposed to return 
>>>>>>>>>>>> IgniteFuture. I.e:
>>>>>>>>>>>> putAsync, getAsync etc... They just block, until the cluster 
>>>>>>>>>>>> resolves it's
>>>>>>>>>>>> state.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Operation block on Cluster recovery/rebalance.

Reply via email to