[ 
https://issues.apache.org/jira/browse/KUDU-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856067#comment-17856067
 ] 

Alexey Serbin commented on KUDU-2679:
-------------------------------------

[~ArnaudL],

{quote}
It has a similar topology (driver = job manager ; executor = task manager) ; 
however the "job manager" does not query the kudu tables ; only the task 
managers do ; so effectively it is not the same issue.
{quote}

One distinctive trait of this issue is that Spark executors never have Kerberos 
credentials, even at the time when a task starts.  They receive the authn 
credentials in form of authentication tokens from the Spark driver, but on 
their own they can never acquire Kudu authn tokens, and that's how it's 
supposed to be.  That's the Spark driver who is to re-acquire authn tokens and 
spawn new tasks with new, non-expired tokens.  However, in the case when Spark 
driver keeps its RPC connection to Kudu master open and authn tokens are 
expired, this issue happens.

This issue isn't about expiring authn tokens for long-running tasks, no.  This 
issue is about specific conditions when Spark driver fails to recognize that 
the authn token it uses to start tasks on executors has expired, and doesn't 
automatically re-acquire the token, even if it has the required Kerberos 
credentials.  That's the issue this JIRA is about.

{quote}
However it is somewhat related since the kudu client initialisation occurs on 
each task manager when the task starts and each time a small batch of rows is 
received an insertion/upsertion is made, failing after 7 days.
{quote}

Please open a separate JIRA for that "somewhat related" issue, and describe the 
problem and how it manifests itself.  From what I understood so far, the issue 
you hit isn't going to be fixed when this JIRA is addressed, so I don't see why 
to mention that "somewhat related" issue in here.

As for more context, Kudu authentication tokens are designed to expire after 
the configured time interval (default is 7 days) -- [that's by 
design|https://github.com/apache/kudu/blob/master/docs/security.adoc#authentication-tokens],
 and it's working exactly as expected.  The Kudu client libraries automatically 
re-acquire authn tokens when detecting their expiration, but it's necessary to 
have valid Kerberos credentials to do so.

> In some scenarios, a Spark Kudu application can be devoid of fresh authn 
> tokens
> -------------------------------------------------------------------------------
>
>                 Key: KUDU-2679
>                 URL: https://issues.apache.org/jira/browse/KUDU-2679
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, security, spark
>    Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1
>            Reporter: Alexey Serbin
>            Priority: Major
>
> When running in {{cluster}} mode, tasks run as a part of Spark Kudu client 
> application can be devoid of getting new (i.e. non-expired) authentication 
> tokens even if they run for a very short time.  Essentially, if the driver 
> runs longer than the authn token expiration interval and has a particular 
> pattern of making RPC calls to Kudu masters and tablet servers, all tasks 
> scheduled to run after the authn token expiration interval will be supplied 
> with expired authn tokens, making every task fail.  The only way to fix that 
> is restarting the application or dropping long-established connections from 
> the driver to Kudu masters/tservers.
> Below are some details, explaining why that can happen.
> Let's assume the following holds true for a Spark Kudu application:
> * The application is running against a secured Kudu cluster.
> * The application is running in the {{cluster}} mode.
> * There are no primary authentication credentials at the machines for the 
> user under which the Spark executors are running (i.e. {{kinit}} hasn't been 
> run at those executor machines for the corresponding user or the Kerberos 
> credentials has already expired there). 
> * The {{--authn_token_validity_seconds}} masters' flag is set to {{X}} 
> seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days).
> * The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet 
> servers, if they are involved into the communications between the driver 
> process and the Kudu backend) is set to {{Y}} milliseconds (default is 65000 
> ms).
> * The application is running for longer than {{X}} seconds.
> * The driver process makes requests to Kudu masters at least every {{Y}} 
> milliseconds.
> * The driver either doesn't make requests to Kudu tablet servers or makes 
> such requests at least every {{Y}} milliseconds to each of the involved 
> tablet servers.
> * The executors are running tasks that keep connections to tablet servers 
> idle for longer than {{Y}} milliseconds or the driver spawns tasks at an 
> executor after {{Y}} milliseconds since last task has completed by the 
> executor.
> Essentially, that's about a Spark Kudu application where the driver process 
> keeps once opened connections active and the executors need to open new 
> connections to Kudu tablet servers (and/or masters).  Also, the executor 
> machines doesn't have Kerberos credentials for the OS user under which the 
> executor processes are run.
> In such scenarios, the application's tasks spawned after {{X}} seconds from 
> the application start will fail because of expired authentication tokens, 
> while the driver process will never re-acquire its authn token, keeping the 
> expired token in {{KuduContext}} forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to