Barry, as always, crystal clear and precise :)

thank you for all these details! I appreciate.

Oliv/

On Mon, Aug 1, 2016 at 11:53 PM, Barry Oglesby <bogle...@pivotal.io> wrote:

> Olivier,
>
> The main answer to your question is: If there isn't a connection to a
> server, the client won't automatically detect that it is gone.
>
> To work around this issue, you should be able to set min-connections="N"
> in the pool configuration where N is equal to or greater than the number of
> servers. I think I would set it higher than the number of servers just to
> be sure that connections are established to all of them.
>
> Below are some details (probably more than you need) and logging that
> shows the behavior.
>
> The default value for min-connections is 1 so by default only 1 connection
> to the servers is made when the client starts.
>
> When the function is initially executed, the client has no metadata about
> the servers. The client metadata is a mapping between servers and
> partitioned region buckets. It is retrieved from the server asynchronously,
> so initially it is empty. So, because of this, one connection to any server
> will be used to execute the function - either the one created initially or
> potentially another one depending on the timing of the
> GetClientPartitionAttributesOp. In either case, its only going to use one
> connection.
>
> In the meantime, the ClientMetadataService has retrieved the metadata from
> the server and populated the client metadata.
>
> So, after the function has executed the first time, the metadata is
> populated, and a connection to one server has been made.
>
> The next part of this is that whenever a connection is made to a server, a
> PingTask is created to periodically ping that server. It pings by default
> every 10 seconds (controlled by the ping-interval pool attribute). So, the
> connection to the one server will be pinged every 10 seconds to ensure that
> server is still alive. No other servers are being pinged since no
> connections are made to them.
>
> The second time the function is executed, the metadata is known and used.
> One SingleHopOperationCallable is created for each server on which the
> function is to be executed. If one of the servers has crashed, then when a
> connection to that server is attempted, a ServerConnectivityException is
> thrown, causing the function to be invoked again without any metadata and
> with isPossibleDuplicate set to true.
>
> This is basically the behavior you're seeing.
>
> To work around this issue, you should be able to set min-connections="N"
> in the pool configuration where N is equal to or greater than the number of
> servers. I think I would set it higher than the number of servers just to
> be sure that connections are established to all of them.
>
> If you do this, there will be a PingTask for each server that will detect
> when it crashes and reset the metadata.
>
> Below is some debugging with and without min-connections set. Feel free to
> skip it if you want.
>
> Without min-connections
> -----------------------
> When the client starts, a Connection and PingTask are created to one
> server:
>
> poolTimer-pool-3: ConnectionImpl.connect creating connection to
> 192.168.2.14:64669
> poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for
> 192.168.2.14:64669
>
> The function is executed using that connection:
>
> main: OpExecutorImpl.execute
> op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@65b2b2f0;
> conn=Pooled Connection to 192.168.2.14:64669:
> Connection[192.168.2.14:64669]@818633134
> main: Executed function with 0 keys in 1227 ms
>
> The client metadata is retrieved using a connection to another server:
>
> Function Execution Thread-1: ConnectionImpl.connect creating connection to
> 192.168.2.14:64685
> Function Execution Thread-1: LiveServerPinger.endpointNowInUse created
> ping task for 192.168.2.14:64685
> Function Execution Thread-1: OpExecutorImpl.execute
> op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@47ee14d4;
> conn=Pooled Connection to 192.168.2.14:64685:
> Connection[192.168.2.14:64685]@669037784
>
> That connection is closed (due to idle-timeout and min-connections="1"):
>
> poolTimer-pool-4: ConnectionImpl.close closing
> Connection[192.168.2.14:64685]@669037784
>
> (The server is killed here)
>
> The function is executed again with single hop. Three threads are created
> (one for each of the servers). Function Execution Thread-2 fails with a
> ServerConnectivityException causing the function to be re-executed without
> single hop.
>
> main: ExecuteRegionFunctionSingleHopOp.execute invoked
> Function Execution Thread-1: OpExecutorImpl.executeOnServer
> op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionSingleHopOp$ExecuteRegionFunctionSingleHopOpImpl@40fe544;
> conn=Pooled Connection to 192.168.2.14:64669:
> Connection[192.168.2.14:64669]@818633134
> Function Execution Thread-2: ConnectionImpl.connect creating connection to
> 192.168.2.14:64685
> Function Execution Thread-3: ConnectionImpl.connect creating connection to
> 192.168.2.14:64691
> Function Execution Thread-2: LiveServerPinger.endpointNowInUse created
> ping task for 192.168.2.14:64685
> Function Execution Thread-2: OpExecutorImpl.executeOnServer
> op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionSingleHopOp$ExecuteRegionFunctionSingleHopOpImpl@622b6f7;
> conn=Pooled Connection to 192.168.2.14:64685:
> Connection[192.168.2.14:64685]@1629752892
> main: SingleHopClientExecutor.submitAllHA caught
> java.util.concurrent.ExecutionException:
> com.gemstone.gemfire.cache.client.ServerConnectivityException: Could not
> create a new connection to server 192.168.2.14:64691 with cause
> com.gemstone.gemfire.cache.client.ServerConnectivityException: Could not
> create a new connection to server 192.168.2.14:64691
> main: ExecuteRegionFunctionSingleHopOp.execute reexecuting
> ExecuteRegionFunctionOp
> Function Execution Thread-2: OpExecutorImpl.execute
> op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@77b90119;
> conn=Pooled Connection to 192.168.2.14:64685:
> Connection[192.168.2.14:64685]@1629752892
> main: OpExecutorImpl.execute
> op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@12f74db7;
> conn=Pooled Connection to 192.168.2.14:64669:
> Connection[192.168.2.14:64669]@818633134
> main: Executed function with 0 keys in 25 ms
>
> With min-connections=3
> ----------------------
> When the client starts, a Connection and PingTask are created for each
> server:
>
> poolTimer-pool-3: ConnectionImpl.connect creating connection to
> 192.168.2.14:65034
> poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for
> 192.168.2.14:65034
> poolTimer-pool-3: ConnectionImpl.connect creating connection to
> 192.168.2.14:65050
> poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for
> 192.168.2.14:65050
> poolTimer-pool-3: ConnectionImpl.connect creating connection to
> 192.168.2.14:65056
> poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for
> 192.168.2.14:65056
>
> The function is executed using one of the connections:
>
> main: OpExecutorImpl.execute
> op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@38c57101;
> conn=Pooled Connection to 192.168.2.14:65034:
> Connection[192.168.2.14:65034]@823328318
> main: Executed function with 0 keys in 1236 ms
>
> The client metadata is retrieved using another of the connections:
>
> Function Execution Thread-1: OpExecutorImpl.execute
> op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@3b56f75d;
> conn=Pooled Connection to 192.168.2.14:65056:
> Connection[192.168.2.14:65056]@653151156
>
> (The server is killed here)
>
> The PingTask for that server realizes the server is gone and handles it
> (it is removed from the list of servers, and the metadata is cleared):
>
> poolTimer-pool-6: PingTask.run2 about ping endpoint=192.168.2.14:65050
> poolTimer-pool-6: OpExecutorImpl.executeOnServer
> op=com.gemstone.gemfire.cache.client.internal.PingOp$PingOpImpl@75d46202;
> conn=Pooled Connection to 192.168.2.14:65050:
> Connection[192.168.2.14:65050]@752413939
> poolTimer-pool-6: EndpointManagerImpl.serverCrashed endpoint=
> 192.168.2.14:65050
>
> The function is executed using one of the connections with no retry:
>
> main: ExecuteRegionFunctionOp.execute invoked
> main: OpExecutorImpl.execute
> op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@77fd8595;
> conn=Pooled Connection to 192.168.2.14:65056:
> Connection[192.168.2.14:65056]@1414071838
> main: Executed function with 0 keys in 10 ms
>
> Thanks,
> Barry Oglesby
>
>
> On Tue, Jul 26, 2016 at 1:44 AM, Olivier Mallassi <
> olivier.malla...@gmail.com> wrote:
>
>> Hi all,
>>
>> I would need your help to better understand the behavior I have observed
>> (regarding function execution with node failure)
>>
>> - I have a function (optimizeForWrite=true, hasResult=true, isHA=true)
>> that is executed (onRegion(mypartitionedRegion)) every two minutes (poll
>> frequency has been increased for test)
>> - then, just after a execution of the function I kill -9 one of the
>> member (member-timeout=1)
>> - then, the function is executed again (around 2 min later). In that
>> case, the function is executed twice (on the remaining members).
>> In that case, the context.isDuplicate() returns true so that I just exit
>> the function
>>
>>
>> if (functionContext.isPossibleDuplicate()) {
>>     logger.warning(....
>>     //exit
>>     functionContext.getResultSender().lastResult(null);
>> }
>>
>>
>> The function being HA, this is the expected behavior.
>>
>> Yet, what I do not understand is that it seems the "node failure" is
>> detected only when the function is executed where as the node failure has
>> already been broadcasted (Membership cluster). Can someone give me more
>> insights on this? Is this a misconfig between client / locator so that
>> client are still not aware of the node failure?
>>
>>
>> Many thx.
>>
>> oliv/
>>
>
>

Reply via email to