[ 
https://issues.apache.org/jira/browse/KUDU-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855804#comment-17855804
 ] 

Arnaud Linz commented on KUDU-2679:
-----------------------------------

It has a similar topology (driver = job manager ; executor = task manager) ; 
however the "job manager" does not query the kudu tables ; only the task 
managers do ; so effectively it is not the same issue. However it is somewhat 
related since the kudu client initialisation occurs on each task manager when 
the task starts and each time a small batch of rows is received an 
insertion/upsertion is made, failing after 7 days.

> In some scenarios, a Spark Kudu application can be devoid of fresh authn 
> tokens
> -------------------------------------------------------------------------------
>
>                 Key: KUDU-2679
>                 URL: https://issues.apache.org/jira/browse/KUDU-2679
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, security, spark
>    Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1
>            Reporter: Alexey Serbin
>            Priority: Major
>
> When running in {{cluster}} mode, tasks run as a part of Spark Kudu client 
> application can be devoid of getting new (i.e. non-expired) authentication 
> tokens even if they run for a very short time.  Essentially, if the driver 
> runs longer than the authn token expiration interval and has a particular 
> pattern of making RPC calls to Kudu masters and tablet servers, all tasks 
> scheduled to run after the authn token expiration interval will be supplied 
> with expired authn tokens, making every task fail.  The only way to fix that 
> is restarting the application or dropping long-established connections from 
> the driver to Kudu masters/tservers.
> Below are some details, explaining why that can happen.
> Let's assume the following holds true for a Spark Kudu application:
> * The application is running against a secured Kudu cluster.
> * The application is running in the {{cluster}} mode.
> * There are no primary authentication credentials at the machines for the 
> user under which the Spark executors are running (i.e. {{kinit}} hasn't been 
> run at those executor machines for the corresponding user or the Kerberos 
> credentials has already expired there). 
> * The {{--authn_token_validity_seconds}} masters' flag is set to {{X}} 
> seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days).
> * The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet 
> servers, if they are involved into the communications between the driver 
> process and the Kudu backend) is set to {{Y}} milliseconds (default is 65000 
> ms).
> * The application is running for longer than {{X}} seconds.
> * The driver process makes requests to Kudu masters at least every {{Y}} 
> milliseconds.
> * The driver either doesn't make requests to Kudu tablet servers or makes 
> such requests at least every {{Y}} milliseconds to each of the involved 
> tablet servers.
> * The executors are running tasks that keep connections to tablet servers 
> idle for longer than {{Y}} milliseconds or the driver spawns tasks at an 
> executor after {{Y}} milliseconds since last task has completed by the 
> executor.
> Essentially, that's about a Spark Kudu application where the driver process 
> keeps once opened connections active and the executors need to open new 
> connections to Kudu tablet servers (and/or masters).  Also, the executor 
> machines doesn't have Kerberos credentials for the OS user under which the 
> executor processes are run.
> In such scenarios, the application's tasks spawned after {{X}} seconds from 
> the application start will fail because of expired authentication tokens, 
> while the driver process will never re-acquire its authn token, keeping the 
> expired token in {{KuduContext}} forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to