Barry, as always, crystal clear and precise :) thank you for all these details! I appreciate.
Oliv/ On Mon, Aug 1, 2016 at 11:53 PM, Barry Oglesby <bogle...@pivotal.io> wrote: > Olivier, > > The main answer to your question is: If there isn't a connection to a > server, the client won't automatically detect that it is gone. > > To work around this issue, you should be able to set min-connections="N" > in the pool configuration where N is equal to or greater than the number of > servers. I think I would set it higher than the number of servers just to > be sure that connections are established to all of them. > > Below are some details (probably more than you need) and logging that > shows the behavior. > > The default value for min-connections is 1 so by default only 1 connection > to the servers is made when the client starts. > > When the function is initially executed, the client has no metadata about > the servers. The client metadata is a mapping between servers and > partitioned region buckets. It is retrieved from the server asynchronously, > so initially it is empty. So, because of this, one connection to any server > will be used to execute the function - either the one created initially or > potentially another one depending on the timing of the > GetClientPartitionAttributesOp. In either case, its only going to use one > connection. > > In the meantime, the ClientMetadataService has retrieved the metadata from > the server and populated the client metadata. > > So, after the function has executed the first time, the metadata is > populated, and a connection to one server has been made. > > The next part of this is that whenever a connection is made to a server, a > PingTask is created to periodically ping that server. It pings by default > every 10 seconds (controlled by the ping-interval pool attribute). So, the > connection to the one server will be pinged every 10 seconds to ensure that > server is still alive. No other servers are being pinged since no > connections are made to them. > > The second time the function is executed, the metadata is known and used. > One SingleHopOperationCallable is created for each server on which the > function is to be executed. If one of the servers has crashed, then when a > connection to that server is attempted, a ServerConnectivityException is > thrown, causing the function to be invoked again without any metadata and > with isPossibleDuplicate set to true. > > This is basically the behavior you're seeing. > > To work around this issue, you should be able to set min-connections="N" > in the pool configuration where N is equal to or greater than the number of > servers. I think I would set it higher than the number of servers just to > be sure that connections are established to all of them. > > If you do this, there will be a PingTask for each server that will detect > when it crashes and reset the metadata. > > Below is some debugging with and without min-connections set. Feel free to > skip it if you want. > > Without min-connections > ----------------------- > When the client starts, a Connection and PingTask are created to one > server: > > poolTimer-pool-3: ConnectionImpl.connect creating connection to > 192.168.2.14:64669 > poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for > 192.168.2.14:64669 > > The function is executed using that connection: > > main: OpExecutorImpl.execute > op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@65b2b2f0; > conn=Pooled Connection to 192.168.2.14:64669: > Connection[192.168.2.14:64669]@818633134 > main: Executed function with 0 keys in 1227 ms > > The client metadata is retrieved using a connection to another server: > > Function Execution Thread-1: ConnectionImpl.connect creating connection to > 192.168.2.14:64685 > Function Execution Thread-1: LiveServerPinger.endpointNowInUse created > ping task for 192.168.2.14:64685 > Function Execution Thread-1: OpExecutorImpl.execute > op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@47ee14d4; > conn=Pooled Connection to 192.168.2.14:64685: > Connection[192.168.2.14:64685]@669037784 > > That connection is closed (due to idle-timeout and min-connections="1"): > > poolTimer-pool-4: ConnectionImpl.close closing > Connection[192.168.2.14:64685]@669037784 > > (The server is killed here) > > The function is executed again with single hop. Three threads are created > (one for each of the servers). Function Execution Thread-2 fails with a > ServerConnectivityException causing the function to be re-executed without > single hop. > > main: ExecuteRegionFunctionSingleHopOp.execute invoked > Function Execution Thread-1: OpExecutorImpl.executeOnServer > op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionSingleHopOp$ExecuteRegionFunctionSingleHopOpImpl@40fe544; > conn=Pooled Connection to 192.168.2.14:64669: > Connection[192.168.2.14:64669]@818633134 > Function Execution Thread-2: ConnectionImpl.connect creating connection to > 192.168.2.14:64685 > Function Execution Thread-3: ConnectionImpl.connect creating connection to > 192.168.2.14:64691 > Function Execution Thread-2: LiveServerPinger.endpointNowInUse created > ping task for 192.168.2.14:64685 > Function Execution Thread-2: OpExecutorImpl.executeOnServer > op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionSingleHopOp$ExecuteRegionFunctionSingleHopOpImpl@622b6f7; > conn=Pooled Connection to 192.168.2.14:64685: > Connection[192.168.2.14:64685]@1629752892 > main: SingleHopClientExecutor.submitAllHA caught > java.util.concurrent.ExecutionException: > com.gemstone.gemfire.cache.client.ServerConnectivityException: Could not > create a new connection to server 192.168.2.14:64691 with cause > com.gemstone.gemfire.cache.client.ServerConnectivityException: Could not > create a new connection to server 192.168.2.14:64691 > main: ExecuteRegionFunctionSingleHopOp.execute reexecuting > ExecuteRegionFunctionOp > Function Execution Thread-2: OpExecutorImpl.execute > op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@77b90119; > conn=Pooled Connection to 192.168.2.14:64685: > Connection[192.168.2.14:64685]@1629752892 > main: OpExecutorImpl.execute > op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@12f74db7; > conn=Pooled Connection to 192.168.2.14:64669: > Connection[192.168.2.14:64669]@818633134 > main: Executed function with 0 keys in 25 ms > > With min-connections=3 > ---------------------- > When the client starts, a Connection and PingTask are created for each > server: > > poolTimer-pool-3: ConnectionImpl.connect creating connection to > 192.168.2.14:65034 > poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for > 192.168.2.14:65034 > poolTimer-pool-3: ConnectionImpl.connect creating connection to > 192.168.2.14:65050 > poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for > 192.168.2.14:65050 > poolTimer-pool-3: ConnectionImpl.connect creating connection to > 192.168.2.14:65056 > poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for > 192.168.2.14:65056 > > The function is executed using one of the connections: > > main: OpExecutorImpl.execute > op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@38c57101; > conn=Pooled Connection to 192.168.2.14:65034: > Connection[192.168.2.14:65034]@823328318 > main: Executed function with 0 keys in 1236 ms > > The client metadata is retrieved using another of the connections: > > Function Execution Thread-1: OpExecutorImpl.execute > op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@3b56f75d; > conn=Pooled Connection to 192.168.2.14:65056: > Connection[192.168.2.14:65056]@653151156 > > (The server is killed here) > > The PingTask for that server realizes the server is gone and handles it > (it is removed from the list of servers, and the metadata is cleared): > > poolTimer-pool-6: PingTask.run2 about ping endpoint=192.168.2.14:65050 > poolTimer-pool-6: OpExecutorImpl.executeOnServer > op=com.gemstone.gemfire.cache.client.internal.PingOp$PingOpImpl@75d46202; > conn=Pooled Connection to 192.168.2.14:65050: > Connection[192.168.2.14:65050]@752413939 > poolTimer-pool-6: EndpointManagerImpl.serverCrashed endpoint= > 192.168.2.14:65050 > > The function is executed using one of the connections with no retry: > > main: ExecuteRegionFunctionOp.execute invoked > main: OpExecutorImpl.execute > op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@77fd8595; > conn=Pooled Connection to 192.168.2.14:65056: > Connection[192.168.2.14:65056]@1414071838 > main: Executed function with 0 keys in 10 ms > > Thanks, > Barry Oglesby > > > On Tue, Jul 26, 2016 at 1:44 AM, Olivier Mallassi < > olivier.malla...@gmail.com> wrote: > >> Hi all, >> >> I would need your help to better understand the behavior I have observed >> (regarding function execution with node failure) >> >> - I have a function (optimizeForWrite=true, hasResult=true, isHA=true) >> that is executed (onRegion(mypartitionedRegion)) every two minutes (poll >> frequency has been increased for test) >> - then, just after a execution of the function I kill -9 one of the >> member (member-timeout=1) >> - then, the function is executed again (around 2 min later). In that >> case, the function is executed twice (on the remaining members). >> In that case, the context.isDuplicate() returns true so that I just exit >> the function >> >> >> if (functionContext.isPossibleDuplicate()) { >> logger.warning(.... >> //exit >> functionContext.getResultSender().lastResult(null); >> } >> >> >> The function being HA, this is the expected behavior. >> >> Yet, what I do not understand is that it seems the "node failure" is >> detected only when the function is executed where as the node failure has >> already been broadcasted (Membership cluster). Can someone give me more >> insights on this? Is this a misconfig between client / locator so that >> client are still not aware of the node failure? >> >> >> Many thx. >> >> oliv/ >> > >