[ https://issues.apache.org/jira/browse/KUDU-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855804#comment-17855804 ]
Arnaud Linz commented on KUDU-2679: ----------------------------------- It has a similar topology (driver = job manager ; executor = task manager) ; however the "job manager" does not query the kudu tables ; only the task managers do ; so effectively it is not the same issue. However it is somewhat related since the kudu client initialisation occurs on each task manager when the task starts and each time a small batch of rows is received an insertion/upsertion is made, failing after 7 days. > In some scenarios, a Spark Kudu application can be devoid of fresh authn > tokens > ------------------------------------------------------------------------------- > > Key: KUDU-2679 > URL: https://issues.apache.org/jira/browse/KUDU-2679 > Project: Kudu > Issue Type: Bug > Components: client, security, spark > Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1 > Reporter: Alexey Serbin > Priority: Major > > When running in {{cluster}} mode, tasks run as a part of Spark Kudu client > application can be devoid of getting new (i.e. non-expired) authentication > tokens even if they run for a very short time. Essentially, if the driver > runs longer than the authn token expiration interval and has a particular > pattern of making RPC calls to Kudu masters and tablet servers, all tasks > scheduled to run after the authn token expiration interval will be supplied > with expired authn tokens, making every task fail. The only way to fix that > is restarting the application or dropping long-established connections from > the driver to Kudu masters/tservers. > Below are some details, explaining why that can happen. > Let's assume the following holds true for a Spark Kudu application: > * The application is running against a secured Kudu cluster. > * The application is running in the {{cluster}} mode. > * There are no primary authentication credentials at the machines for the > user under which the Spark executors are running (i.e. {{kinit}} hasn't been > run at those executor machines for the corresponding user or the Kerberos > credentials has already expired there). > * The {{--authn_token_validity_seconds}} masters' flag is set to {{X}} > seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days). > * The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet > servers, if they are involved into the communications between the driver > process and the Kudu backend) is set to {{Y}} milliseconds (default is 65000 > ms). > * The application is running for longer than {{X}} seconds. > * The driver process makes requests to Kudu masters at least every {{Y}} > milliseconds. > * The driver either doesn't make requests to Kudu tablet servers or makes > such requests at least every {{Y}} milliseconds to each of the involved > tablet servers. > * The executors are running tasks that keep connections to tablet servers > idle for longer than {{Y}} milliseconds or the driver spawns tasks at an > executor after {{Y}} milliseconds since last task has completed by the > executor. > Essentially, that's about a Spark Kudu application where the driver process > keeps once opened connections active and the executors need to open new > connections to Kudu tablet servers (and/or masters). Also, the executor > machines doesn't have Kerberos credentials for the OS user under which the > executor processes are run. > In such scenarios, the application's tasks spawned after {{X}} seconds from > the application start will fail because of expired authentication tokens, > while the driver process will never re-acquire its authn token, keeping the > expired token in {{KuduContext}} forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)