[
https://issues.apache.org/jira/browse/KUDU-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855662#comment-17855662
]
Arnaud Linz commented on KUDU-2679:
-----------------------------------
(The same happens with Flink streaming applications with a Kudu Sink. After the
{{--authn_token_validity_seconds}} period we have to restart the application.)
> In some scenarios, a Spark Kudu application can be devoid of fresh authn
> tokens
> -------------------------------------------------------------------------------
>
> Key: KUDU-2679
> URL: https://issues.apache.org/jira/browse/KUDU-2679
> Project: Kudu
> Issue Type: Bug
> Components: client, security, spark
> Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1
> Reporter: Alexey Serbin
> Priority: Major
>
> When running in {{cluster}} mode, tasks run as a part of Spark Kudu client
> application can be devoid of getting new (i.e. non-expired) authentication
> tokens even if they run for a very short time. Essentially, if the driver
> runs longer than the authn token expiration interval and has a particular
> pattern of making RPC calls to Kudu masters and tablet servers, all tasks
> scheduled to run after the authn token expiration interval will be supplied
> with expired authn tokens, making every task fail. The only way to fix that
> is restarting the application or dropping long-established connections from
> the driver to Kudu masters/tservers.
> Below are some details, explaining why that can happen.
> Let's assume the following holds true for a Spark Kudu application:
> * The application is running against a secured Kudu cluster.
> * The application is running in the {{cluster}} mode.
> * There are no primary authentication credentials at the machines for the
> user under which the Spark executors are running (i.e. {{kinit}} hasn't been
> run at those executor machines for the corresponding user or the Kerberos
> credentials has already expired there).
> * The {{--authn_token_validity_seconds}} masters' flag is set to {{X}}
> seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days).
> * The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet
> servers, if they are involved into the communications between the driver
> process and the Kudu backend) is set to {{Y}} milliseconds (default is 65000
> ms).
> * The application is running for longer than {{X}} seconds.
> * The driver process makes requests to Kudu masters at least every {{Y}}
> milliseconds.
> * The driver either doesn't make requests to Kudu tablet servers or makes
> such requests at least every {{Y}} milliseconds to each of the involved
> tablet servers.
> * The executors are running tasks that keep connections to tablet servers
> idle for longer than {{Y}} milliseconds or the driver spawns tasks at an
> executor after {{Y}} milliseconds since last task has completed by the
> executor.
> Essentially, that's about a Spark Kudu application where the driver process
> keeps once opened connections active and the executors need to open new
> connections to Kudu tablet servers (and/or masters). Also, the executor
> machines doesn't have Kerberos credentials for the OS user under which the
> executor processes are run.
> In such scenarios, the application's tasks spawned after {{X}} seconds from
> the application start will fail because of expired authentication tokens,
> while the driver process will never re-acquire its authn token, keeping the
> expired token in {{KuduContext}} forever.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)