Hi Denis, for everyones reference: https://issues.apache.org/jira/browse/IGNITE-13372
On Mon, 17 Aug 2020 at 14:28, Denis Magda <[email protected]> wrote: > But on client reconnect, doesn't it mean it will still block until the >> cluster is active even if I get new IgniteCache instance? > > > No, the client will be getting an exception on an attempt to get an > IgniteCache instance. > > - > Denis > > > On Fri, Aug 14, 2020 at 4:14 PM John Smith <[email protected]> wrote: > >> Yeah I can maybe use vertx event bus or something to do this... But now I >> have to tie the ignite instance to the IgniteCahe repository I wrote. >> >> But on client reconnect, doesn't it mean it will still block until the >> cluster is active even if I get new IgniteCache instance? >> >> On Fri, 14 Aug 2020 at 18:22, Denis Magda <[email protected]> wrote: >> >>> @Evgenii Zhuravlev <[email protected]>, @Ilya Kasnacheev >>> <[email protected]>, any thoughts on this? >>> >>> As a dirty workaround, you can update your cache references on client >>> reconnect events. You will be getting an exception by calling >>> ignite.cache(cacheName) in the time when the cluster is not activated yet. >>> Does this work for you? >>> >>> - >>> Denis >>> >>> >>> On Fri, Aug 14, 2020 at 3:12 PM John Smith <[email protected]> >>> wrote: >>> >>>> Is there any work around? I can't have an HTTP server block on all >>>> requests. >>>> >>>> 1- I need to figure out why I lose a server nodes every few weeks, >>>> which when rebooting the nodes cause the inactive state until they are >>>> back.... >>>> >>>> 2- Implement some kind of logic on the client side not to block the >>>> HTTP part... >>>> >>>> Can IgniteCache instance be notified of disconnected events so I can >>>> maybe tell the repository class I have to set a flag to skip the operation? >>>> >>>> >>>> On Fri., Aug. 14, 2020, 5:17 p.m. Denis Magda, <[email protected]> >>>> wrote: >>>> >>>>> My guess that it's standard behavior for all operations (SQL, >>>>> key-value, compute, etc.). But I'll let the maintainers of those modules >>>>> clarify. >>>>> >>>>> - >>>>> Denis >>>>> >>>>> >>>>> On Fri, Aug 14, 2020 at 1:44 PM John Smith <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Denis, so to understand it's all operations or just the query? >>>>>> >>>>>> On Fri., Aug. 14, 2020, 12:53 p.m. Denis Magda, <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> John, >>>>>>> >>>>>>> Ok, we nailed it. That's the current expected behavior. Generally, I >>>>>>> agree with you that the platform should support an option when >>>>>>> operations >>>>>>> fail if the cluster is deactivated. Could you propose the change by >>>>>>> starting a discussion on the dev list? You can refer to this user list >>>>>>> discussion for reference. Let me know if you need help with this. >>>>>>> >>>>>>> - >>>>>>> Denis >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 13, 2020 at 5:55 PM John Smith <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> No I, reuse the instance. The cache instance is created once at >>>>>>>> startup of the application and I pass it to my "repository" class >>>>>>>> >>>>>>>> public abstract class AbstractIgniteRepository<K,V> implements >>>>>>>> CacheRepository<K, V> { >>>>>>>> public final long DEFAULT_OPERATION_TIMEOUT = 2000; >>>>>>>> >>>>>>>> private Vertx vertx; >>>>>>>> private IgniteCache<K, V> cache; >>>>>>>> >>>>>>>> AbstractIgniteRepository(Vertx vertx, IgniteCache<K, V> cache) { >>>>>>>> this.vertx = vertx; >>>>>>>> this.cache = cache; >>>>>>>> } >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> Future<List<JsonArray>> query(final String sql, final long >>>>>>>> timeoutMs, final Object... args) { >>>>>>>> final Promise<List<JsonArray>> promise = Promise.promise(); >>>>>>>> >>>>>>>> vertx.setTimer(timeoutMs, l -> { >>>>>>>> promise.tryFail(new TimeoutException("Cache operation did >>>>>>>> not complete within: " + timeoutMs + " Ms.")); // THIS FIRE IF THE >>>>>>>> BLOE DOESN"T COMPLETE IN TIME. >>>>>>>> }); >>>>>>>> >>>>>>>> vertx.<List<JsonArray>>executeBlocking(code -> { >>>>>>>> SqlFieldsQuery query = new >>>>>>>> SqlFieldsQuery(sql).setArgs(args); >>>>>>>> query.setTimeout((int) timeoutMs, TimeUnit.MILLISECONDS); >>>>>>>> >>>>>>>> >>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query)) { >>>>>>>> // <--- BLOCKS HERE. >>>>>>>> List<JsonArray> rows = new ArrayList<>(); >>>>>>>> Iterator<List<?>> iterator = cursor.iterator(); >>>>>>>> >>>>>>>> while(iterator.hasNext()) { >>>>>>>> List currentRow = iterator.next(); >>>>>>>> JsonArray row = new JsonArray(); >>>>>>>> >>>>>>>> currentRow.forEach(o -> row.add(o)); >>>>>>>> >>>>>>>> rows.add(row); >>>>>>>> } >>>>>>>> >>>>>>>> code.complete(rows); >>>>>>>> } catch(Exception ex) { >>>>>>>> code.fail(ex); >>>>>>>> } >>>>>>>> }, result -> { >>>>>>>> if(result.succeeded()) { >>>>>>>> promise.tryComplete(result.result()); >>>>>>>> } else { >>>>>>>> promise.tryFail(result.cause()); >>>>>>>> } >>>>>>>> }); >>>>>>>> >>>>>>>> return promise.future(); >>>>>>>> } >>>>>>>> >>>>>>>> public <T> T cache() { >>>>>>>> return (T) cache; >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 13 Aug 2020 at 16:29, Denis Magda <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I've created a simple test and always getting the exception below >>>>>>>>> on an attempt to get a reference to an IgniteCache instance in cases >>>>>>>>> when >>>>>>>>> the cluster is not activated: >>>>>>>>> >>>>>>>>> *Exception in thread "main" class >>>>>>>>> org.apache.ignite.IgniteException: Can not perform the operation >>>>>>>>> because >>>>>>>>> the cluster is inactive. Note, that the cluster is considered >>>>>>>>> inactive by >>>>>>>>> default if Ignite Persistent Store is used to let all the nodes join >>>>>>>>> the >>>>>>>>> cluster. To activate the cluster call Ignite.active(true)* >>>>>>>>> >>>>>>>>> Are you trying to get a new IgniteCache reference whenever the >>>>>>>>> client reconnects successfully to the cluster? My guts feel that >>>>>>>>> currently, >>>>>>>>> Ignite verifies the activation status and generates the exception >>>>>>>>> above >>>>>>>>> whenever you're getting a reference to an IgniteCache or >>>>>>>>> IgniteCompute. But >>>>>>>>> once you got those references and try to run some operations then >>>>>>>>> those get >>>>>>>>> stuck if the cluster is not activated. >>>>>>>>> - >>>>>>>>> Denis >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 13, 2020 at 6:37 AM John Smith <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The cache.query() starts to block when ignite server nodes are >>>>>>>>>> being restarted and there's no baseline topology yet. The server >>>>>>>>>> nodes do >>>>>>>>>> not block. It's the client that blocks. >>>>>>>>>> >>>>>>>>>> The dumpfiles are of the server nodes. The screen shot is from >>>>>>>>>> the client app using your kit profiler on the client side the >>>>>>>>>> threads are >>>>>>>>>> marked as red on your kit. >>>>>>>>>> >>>>>>>>>> The app is simple, make http request, it runs cache Sql query on >>>>>>>>>> ignite and if it succeeds does a put back to ignite. >>>>>>>>>> >>>>>>>>>> The Client disconnected exception only happens when all server >>>>>>>>>> nodes in the cluster are down. The blockage only happens when the >>>>>>>>>> cluster >>>>>>>>>> is trying to establish baseline topology. >>>>>>>>>> >>>>>>>>>> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> John, >>>>>>>>>>> >>>>>>>>>>> I don't see any traits of an application-caused deadlock in the >>>>>>>>>>> thread dumps. Please elaborate on the following: >>>>>>>>>>> >>>>>>>>>>> 7- Restart 1st node, run operation, operation fails with >>>>>>>>>>>> ClientDisconectedException but application still able to complete >>>>>>>>>>>> it's >>>>>>>>>>>> request. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> What's the IP address of the server node the client app uses to >>>>>>>>>>> join the cluster? If that's not the address of the 1st node, that is >>>>>>>>>>> already restarted, then the client couldn't join the cluster and >>>>>>>>>>> it's >>>>>>>>>>> expected that it fails with the ClientDisconnectedException. >>>>>>>>>>> >>>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations >>>>>>>>>>>> just block. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Are the operations unblocked and completed successfully when the >>>>>>>>>>> third node joins the cluster and the cluster gets activated >>>>>>>>>>> automatically? >>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> Denis >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 12, 2020 at 11:08 AM John Smith < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ok Denis here they are... >>>>>>>>>>>> >>>>>>>>>>>> 3 nodes and I capture a yourlit screenshot of what it thinks >>>>>>>>>>>> are deadlocks on the client app. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0 >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 12 Aug 2020 at 11:07, John Smith < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Denis. I will asap but you I think you were right it is the >>>>>>>>>>>>> query that blocks. >>>>>>>>>>>>> >>>>>>>>>>>>> My application first first runs a select on the cache and then >>>>>>>>>>>>> does a put to cache. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> John, >>>>>>>>>>>>>> >>>>>>>>>>>>>> It sounds like a deadlock caused by the application logic. Is >>>>>>>>>>>>>> there any chance that the operation you run on step 8 accesses >>>>>>>>>>>>>> several keys >>>>>>>>>>>>>> in one order while the other operations work with the same keys >>>>>>>>>>>>>> but in a >>>>>>>>>>>>>> different order. The deadlocks are possible when you use Ignite >>>>>>>>>>>>>> Transaction >>>>>>>>>>>>>> API or simply execute bulk operations such as cache.readAll() or >>>>>>>>>>>>>> cache.writeAll(..). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please take and attach thread dumps from all the cluster >>>>>>>>>>>>>> nodes for analysis if we need to dig deeper. >>>>>>>>>>>>>> >>>>>>>>>>>>>> - >>>>>>>>>>>>>> Denis >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Denis, I think you are right. It's the query that blocks >>>>>>>>>>>>>>> the other k/v operations are ok. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Any thoughts on this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I tried with 2.8.1, same issue. Operations block >>>>>>>>>>>>>>>> indefinitely... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1- Start 3 node cluster >>>>>>>>>>>>>>>> 2- Start client application client = true with >>>>>>>>>>>>>>>> Ignition.start() >>>>>>>>>>>>>>>> 3- Run some cache operations, everything ok... >>>>>>>>>>>>>>>> 4- Shut down one node, run operation, still ok >>>>>>>>>>>>>>>> 5- Shut down 2nd node, run operation, still ok >>>>>>>>>>>>>>>> 6- Shut down 3rd node, run operation, still ok... >>>>>>>>>>>>>>>> Operations start failing with ClientDisconectedException... >>>>>>>>>>>>>>>> 7- Restart 1st node, run operation, operation fails >>>>>>>>>>>>>>>> with ClientDisconectedException but application still able to >>>>>>>>>>>>>>>> complete it's >>>>>>>>>>>>>>>> request. >>>>>>>>>>>>>>>> 8- Start 2nd node, run operation, from here on all >>>>>>>>>>>>>>>> operations just block. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Basically the client application is an HTTP Server on each >>>>>>>>>>>>>>>> HTTP request does cache exception. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> No, everything blocks... Also using 2.7.0 just in case. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Only time I get exception is if the cluster is >>>>>>>>>>>>>>>>> completely off, then I get ClientDisconectedException... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put) >>>>>>>>>>>>>>>>>> and compute calls fail with an exception if the cluster is >>>>>>>>>>>>>>>>>> deactivated. Do >>>>>>>>>>>>>>>>>> those fail on your end? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> As for the async and SQL operations, let's see what other >>>>>>>>>>>>>>>>>> community members say. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>> Denis >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi any thoughts on this? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Here is another example where it blocks. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery( >>>>>>>>>>>>>>>>>>>> "select * from my_table") >>>>>>>>>>>>>>>>>>>> .setArgs(providerId, carrierCode); >>>>>>>>>>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS); >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query)) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> cache.query just blocks even with the timeout set. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Is there a way to timeout and at least have the >>>>>>>>>>>>>>>>>>>> application continue and respond with an appropriate >>>>>>>>>>>>>>>>>>>> message? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi running 2.7.0 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> When I reboot a node and it begins to rejoin the >>>>>>>>>>>>>>>>>>>>> cluster or the cluster is not yet activated with baseline >>>>>>>>>>>>>>>>>>>>> topology >>>>>>>>>>>>>>>>>>>>> operations seem to block forever, operations that are >>>>>>>>>>>>>>>>>>>>> supposed to return >>>>>>>>>>>>>>>>>>>>> IgniteFuture. I.e: putAsync, getAsync etc... They just >>>>>>>>>>>>>>>>>>>>> block, until the >>>>>>>>>>>>>>>>>>>>> cluster resolves it's state. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>
