Olivier, The main answer to your question is: If there isn't a connection to a server, the client won't automatically detect that it is gone.
To work around this issue, you should be able to set min-connections="N" in the pool configuration where N is equal to or greater than the number of servers. I think I would set it higher than the number of servers just to be sure that connections are established to all of them. Below are some details (probably more than you need) and logging that shows the behavior. The default value for min-connections is 1 so by default only 1 connection to the servers is made when the client starts. When the function is initially executed, the client has no metadata about the servers. The client metadata is a mapping between servers and partitioned region buckets. It is retrieved from the server asynchronously, so initially it is empty. So, because of this, one connection to any server will be used to execute the function - either the one created initially or potentially another one depending on the timing of the GetClientPartitionAttributesOp. In either case, its only going to use one connection. In the meantime, the ClientMetadataService has retrieved the metadata from the server and populated the client metadata. So, after the function has executed the first time, the metadata is populated, and a connection to one server has been made. The next part of this is that whenever a connection is made to a server, a PingTask is created to periodically ping that server. It pings by default every 10 seconds (controlled by the ping-interval pool attribute). So, the connection to the one server will be pinged every 10 seconds to ensure that server is still alive. No other servers are being pinged since no connections are made to them. The second time the function is executed, the metadata is known and used. One SingleHopOperationCallable is created for each server on which the function is to be executed. If one of the servers has crashed, then when a connection to that server is attempted, a ServerConnectivityException is thrown, causing the function to be invoked again without any metadata and with isPossibleDuplicate set to true. This is basically the behavior you're seeing. To work around this issue, you should be able to set min-connections="N" in the pool configuration where N is equal to or greater than the number of servers. I think I would set it higher than the number of servers just to be sure that connections are established to all of them. If you do this, there will be a PingTask for each server that will detect when it crashes and reset the metadata. Below is some debugging with and without min-connections set. Feel free to skip it if you want. Without min-connections ----------------------- When the client starts, a Connection and PingTask are created to one server: poolTimer-pool-3: ConnectionImpl.connect creating connection to 192.168.2.14:64669 poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for 192.168.2.14:64669 The function is executed using that connection: main: OpExecutorImpl.execute op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@65b2b2f0; conn=Pooled Connection to 192.168.2.14:64669: Connection[192.168.2.14:64669 ]@818633134 main: Executed function with 0 keys in 1227 ms The client metadata is retrieved using a connection to another server: Function Execution Thread-1: ConnectionImpl.connect creating connection to 192.168.2.14:64685 Function Execution Thread-1: LiveServerPinger.endpointNowInUse created ping task for 192.168.2.14:64685 Function Execution Thread-1: OpExecutorImpl.execute op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@47ee14d4; conn=Pooled Connection to 192.168.2.14:64685: Connection[192.168.2.14:64685 ]@669037784 That connection is closed (due to idle-timeout and min-connections="1"): poolTimer-pool-4: ConnectionImpl.close closing Connection[192.168.2.14:64685 ]@669037784 (The server is killed here) The function is executed again with single hop. Three threads are created (one for each of the servers). Function Execution Thread-2 fails with a ServerConnectivityException causing the function to be re-executed without single hop. main: ExecuteRegionFunctionSingleHopOp.execute invoked Function Execution Thread-1: OpExecutorImpl.executeOnServer op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionSingleHopOp$ExecuteRegionFunctionSingleHopOpImpl@40fe544; conn=Pooled Connection to 192.168.2.14:64669: Connection[192.168.2.14:64669 ]@818633134 Function Execution Thread-2: ConnectionImpl.connect creating connection to 192.168.2.14:64685 Function Execution Thread-3: ConnectionImpl.connect creating connection to 192.168.2.14:64691 Function Execution Thread-2: LiveServerPinger.endpointNowInUse created ping task for 192.168.2.14:64685 Function Execution Thread-2: OpExecutorImpl.executeOnServer op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionSingleHopOp$ExecuteRegionFunctionSingleHopOpImpl@622b6f7; conn=Pooled Connection to 192.168.2.14:64685: Connection[192.168.2.14:64685 ]@1629752892 main: SingleHopClientExecutor.submitAllHA caught java.util.concurrent.ExecutionException: com.gemstone.gemfire.cache.client.ServerConnectivityException: Could not create a new connection to server 192.168.2.14:64691 with cause com.gemstone.gemfire.cache.client.ServerConnectivityException: Could not create a new connection to server 192.168.2.14:64691 main: ExecuteRegionFunctionSingleHopOp.execute reexecuting ExecuteRegionFunctionOp Function Execution Thread-2: OpExecutorImpl.execute op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@77b90119; conn=Pooled Connection to 192.168.2.14:64685: Connection[192.168.2.14:64685 ]@1629752892 main: OpExecutorImpl.execute op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@12f74db7; conn=Pooled Connection to 192.168.2.14:64669: Connection[192.168.2.14:64669 ]@818633134 main: Executed function with 0 keys in 25 ms With min-connections=3 ---------------------- When the client starts, a Connection and PingTask are created for each server: poolTimer-pool-3: ConnectionImpl.connect creating connection to 192.168.2.14:65034 poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for 192.168.2.14:65034 poolTimer-pool-3: ConnectionImpl.connect creating connection to 192.168.2.14:65050 poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for 192.168.2.14:65050 poolTimer-pool-3: ConnectionImpl.connect creating connection to 192.168.2.14:65056 poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for 192.168.2.14:65056 The function is executed using one of the connections: main: OpExecutorImpl.execute op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@38c57101; conn=Pooled Connection to 192.168.2.14:65034: Connection[192.168.2.14:65034 ]@823328318 main: Executed function with 0 keys in 1236 ms The client metadata is retrieved using another of the connections: Function Execution Thread-1: OpExecutorImpl.execute op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@3b56f75d; conn=Pooled Connection to 192.168.2.14:65056: Connection[192.168.2.14:65056 ]@653151156 (The server is killed here) The PingTask for that server realizes the server is gone and handles it (it is removed from the list of servers, and the metadata is cleared): poolTimer-pool-6: PingTask.run2 about ping endpoint=192.168.2.14:65050 poolTimer-pool-6: OpExecutorImpl.executeOnServer op=com.gemstone.gemfire.cache.client.internal.PingOp$PingOpImpl@75d46202; conn=Pooled Connection to 192.168.2.14:65050: Connection[192.168.2.14:65050 ]@752413939 poolTimer-pool-6: EndpointManagerImpl.serverCrashed endpoint= 192.168.2.14:65050 The function is executed using one of the connections with no retry: main: ExecuteRegionFunctionOp.execute invoked main: OpExecutorImpl.execute op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@77fd8595; conn=Pooled Connection to 192.168.2.14:65056: Connection[192.168.2.14:65056 ]@1414071838 main: Executed function with 0 keys in 10 ms Thanks, Barry Oglesby On Tue, Jul 26, 2016 at 1:44 AM, Olivier Mallassi < olivier.malla...@gmail.com> wrote: > Hi all, > > I would need your help to better understand the behavior I have observed > (regarding function execution with node failure) > > - I have a function (optimizeForWrite=true, hasResult=true, isHA=true) > that is executed (onRegion(mypartitionedRegion)) every two minutes (poll > frequency has been increased for test) > - then, just after a execution of the function I kill -9 one of the > member (member-timeout=1) > - then, the function is executed again (around 2 min later). In that case, > the function is executed twice (on the remaining members). > In that case, the context.isDuplicate() returns true so that I just exit > the function > > > if (functionContext.isPossibleDuplicate()) { > logger.warning(.... > //exit > functionContext.getResultSender().lastResult(null); > } > > > The function being HA, this is the expected behavior. > > Yet, what I do not understand is that it seems the "node failure" is > detected only when the function is executed where as the node failure has > already been broadcasted (Membership cluster). Can someone give me more > insights on this? Is this a misconfig between client / locator so that > client are still not aware of the node failure? > > > Many thx. > > oliv/ >