Re: Function Execution and HA

Barry Oglesby Mon, 01 Aug 2016 14:54:07 -0700

Olivier,

The main answer to your question is: If there isn't a connection to a
server, the client won't automatically detect that it is gone.

To work around this issue, you should be able to set min-connections="N" in
the pool configuration where N is equal to or greater than the number of
servers. I think I would set it higher than the number of servers just to
be sure that connections are established to all of them.

Below are some details (probably more than you need) and logging that shows
the behavior.

The default value for min-connections is 1 so by default only 1 connection
to the servers is made when the client starts.

When the function is initially executed, the client has no metadata about
the servers. The client metadata is a mapping between servers and
partitioned region buckets. It is retrieved from the server asynchronously,
so initially it is empty. So, because of this, one connection to any server
will be used to execute the function - either the one created initially or
potentially another one depending on the timing of the
GetClientPartitionAttributesOp. In either case, its only going to use one
connection.

In the meantime, the ClientMetadataService has retrieved the metadata from
the server and populated the client metadata.

So, after the function has executed the first time, the metadata is
populated, and a connection to one server has been made.

The next part of this is that whenever a connection is made to a server, a
PingTask is created to periodically ping that server. It pings by default
every 10 seconds (controlled by the ping-interval pool attribute). So, the
connection to the one server will be pinged every 10 seconds to ensure that
server is still alive. No other servers are being pinged since no
connections are made to them.

The second time the function is executed, the metadata is known and used.
One SingleHopOperationCallable is created for each server on which the
function is to be executed. If one of the servers has crashed, then when a
connection to that server is attempted, a ServerConnectivityException is
thrown, causing the function to be invoked again without any metadata and
with isPossibleDuplicate set to true.

This is basically the behavior you're seeing.

To work around this issue, you should be able to set min-connections="N" in
the pool configuration where N is equal to or greater than the number of
servers. I think I would set it higher than the number of servers just to
be sure that connections are established to all of them.

If you do this, there will be a PingTask for each server that will detect
when it crashes and reset the metadata.

Below is some debugging with and without min-connections set. Feel free to
skip it if you want.

Without min-connections
-----------------------
When the client starts, a Connection and PingTask are created to one server:

poolTimer-pool-3: ConnectionImpl.connect creating connection to
192.168.2.14:64669
poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for
192.168.2.14:64669

The function is executed using that connection:

main: OpExecutorImpl.execute
op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@65b2b2f0;
conn=Pooled Connection to 192.168.2.14:64669: Connection[192.168.2.14:64669
]@818633134
main: Executed function with 0 keys in 1227 ms

The client metadata is retrieved using a connection to another server:

Function Execution Thread-1: ConnectionImpl.connect creating connection to
192.168.2.14:64685
Function Execution Thread-1: LiveServerPinger.endpointNowInUse created ping
task for 192.168.2.14:64685
Function Execution Thread-1: OpExecutorImpl.execute
op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@47ee14d4;
conn=Pooled Connection to 192.168.2.14:64685: Connection[192.168.2.14:64685
]@669037784

That connection is closed (due to idle-timeout and min-connections="1"):

poolTimer-pool-4: ConnectionImpl.close closing Connection[192.168.2.14:64685
]@669037784

(The server is killed here)

The function is executed again with single hop. Three threads are created
(one for each of the servers). Function Execution Thread-2 fails with a
ServerConnectivityException causing the function to be re-executed without
single hop.

main: ExecuteRegionFunctionSingleHopOp.execute invoked
Function Execution Thread-1: OpExecutorImpl.executeOnServer
op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionSingleHopOp$ExecuteRegionFunctionSingleHopOpImpl@40fe544;
conn=Pooled Connection to 192.168.2.14:64669: Connection[192.168.2.14:64669
]@818633134
Function Execution Thread-2: ConnectionImpl.connect creating connection to
192.168.2.14:64685
Function Execution Thread-3: ConnectionImpl.connect creating connection to
192.168.2.14:64691
Function Execution Thread-2: LiveServerPinger.endpointNowInUse created ping
task for 192.168.2.14:64685
Function Execution Thread-2: OpExecutorImpl.executeOnServer
op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionSingleHopOp$ExecuteRegionFunctionSingleHopOpImpl@622b6f7;
conn=Pooled Connection to 192.168.2.14:64685: Connection[192.168.2.14:64685
]@1629752892
main: SingleHopClientExecutor.submitAllHA caught
java.util.concurrent.ExecutionException:
com.gemstone.gemfire.cache.client.ServerConnectivityException: Could not
create a new connection to server 192.168.2.14:64691 with cause
com.gemstone.gemfire.cache.client.ServerConnectivityException: Could not
create a new connection to server 192.168.2.14:64691
main: ExecuteRegionFunctionSingleHopOp.execute reexecuting
ExecuteRegionFunctionOp
Function Execution Thread-2: OpExecutorImpl.execute
op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@77b90119;
conn=Pooled Connection to 192.168.2.14:64685: Connection[192.168.2.14:64685
]@1629752892
main: OpExecutorImpl.execute
op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@12f74db7;
conn=Pooled Connection to 192.168.2.14:64669: Connection[192.168.2.14:64669
]@818633134
main: Executed function with 0 keys in 25 ms

With min-connections=3
----------------------
When the client starts, a Connection and PingTask are created for each
server:

poolTimer-pool-3: ConnectionImpl.connect creating connection to
192.168.2.14:65034
poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for
192.168.2.14:65034
poolTimer-pool-3: ConnectionImpl.connect creating connection to
192.168.2.14:65050
poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for
192.168.2.14:65050
poolTimer-pool-3: ConnectionImpl.connect creating connection to
192.168.2.14:65056
poolTimer-pool-3: LiveServerPinger.endpointNowInUse created PingTask for
192.168.2.14:65056

The function is executed using one of the connections:

main: OpExecutorImpl.execute
op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@38c57101;
conn=Pooled Connection to 192.168.2.14:65034: Connection[192.168.2.14:65034
]@823328318
main: Executed function with 0 keys in 1236 ms

The client metadata is retrieved using another of the connections:

Function Execution Thread-1: OpExecutorImpl.execute
op=com.gemstone.gemfire.cache.client.internal.GetClientPRMetaDataOp$GetClientPRMetaDataOpImpl@3b56f75d;
conn=Pooled Connection to 192.168.2.14:65056: Connection[192.168.2.14:65056
]@653151156

(The server is killed here)

The PingTask for that server realizes the server is gone and handles it (it
is removed from the list of servers, and the metadata is cleared):

poolTimer-pool-6: PingTask.run2 about ping endpoint=192.168.2.14:65050
poolTimer-pool-6: OpExecutorImpl.executeOnServer
op=com.gemstone.gemfire.cache.client.internal.PingOp$PingOpImpl@75d46202;
conn=Pooled Connection to 192.168.2.14:65050: Connection[192.168.2.14:65050
]@752413939
poolTimer-pool-6: EndpointManagerImpl.serverCrashed endpoint=
192.168.2.14:65050

The function is executed using one of the connections with no retry:

main: ExecuteRegionFunctionOp.execute invoked
main: OpExecutorImpl.execute
op=com.gemstone.gemfire.cache.client.internal.ExecuteRegionFunctionOp$ExecuteRegionFunctionOpImpl@77fd8595;
conn=Pooled Connection to 192.168.2.14:65056: Connection[192.168.2.14:65056
]@1414071838
main: Executed function with 0 keys in 10 ms

Thanks,
Barry Oglesby

On Tue, Jul 26, 2016 at 1:44 AM, Olivier Mallassi <
olivier.malla...@gmail.com> wrote:

> Hi all,
>
> I would need your help to better understand the behavior I have observed
> (regarding function execution with node failure)
>
> - I have a function (optimizeForWrite=true, hasResult=true, isHA=true)
> that is executed (onRegion(mypartitionedRegion)) every two minutes (poll
> frequency has been increased for test)
> - then, just after a execution of the function I kill -9 one of the
> member (member-timeout=1)
> - then, the function is executed again (around 2 min later). In that case,
> the function is executed twice (on the remaining members).
> In that case, the context.isDuplicate() returns true so that I just exit
> the function
>
>
> if (functionContext.isPossibleDuplicate()) {
>     logger.warning(....
>     //exit
>     functionContext.getResultSender().lastResult(null);
> }
>
>
> The function being HA, this is the expected behavior.
>
> Yet, what I do not understand is that it seems the "node failure" is
> detected only when the function is executed where as the node failure has
> already been broadcasted (Membership cluster). Can someone give me more
> insights on this? Is this a misconfig between client / locator so that
> client are still not aware of the node failure?
>
>
> Many thx.
>
> oliv/
>

Re: Function Execution and HA

Reply via email to