[ https://issues.apache.org/jira/browse/KUDU-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855666#comment-17855666 ]
Alexey Serbin commented on KUDU-2679: ------------------------------------- {quote} The same happens with Flink streaming applications with a Kudu Sink. After the --authn_token_validity_seconds period we have to restart the application. {quote} [~ArnaudL], The crux of this issue with in presence of two actors of different types in Spark: the driver and the executors. Does Flink have similar topology when assigning tasks? If not, then it's not the same issue. > In some scenarios, a Spark Kudu application can be devoid of fresh authn > tokens > ------------------------------------------------------------------------------- > > Key: KUDU-2679 > URL: https://issues.apache.org/jira/browse/KUDU-2679 > Project: Kudu > Issue Type: Bug > Components: client, security, spark > Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1 > Reporter: Alexey Serbin > Priority: Major > > When running in {{cluster}} mode, tasks run as a part of Spark Kudu client > application can be devoid of getting new (i.e. non-expired) authentication > tokens even if they run for a very short time. Essentially, if the driver > runs longer than the authn token expiration interval and has a particular > pattern of making RPC calls to Kudu masters and tablet servers, all tasks > scheduled to run after the authn token expiration interval will be supplied > with expired authn tokens, making every task fail. The only way to fix that > is restarting the application or dropping long-established connections from > the driver to Kudu masters/tservers. > Below are some details, explaining why that can happen. > Let's assume the following holds true for a Spark Kudu application: > * The application is running against a secured Kudu cluster. > * The application is running in the {{cluster}} mode. > * There are no primary authentication credentials at the machines for the > user under which the Spark executors are running (i.e. {{kinit}} hasn't been > run at those executor machines for the corresponding user or the Kerberos > credentials has already expired there). > * The {{--authn_token_validity_seconds}} masters' flag is set to {{X}} > seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days). > * The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet > servers, if they are involved into the communications between the driver > process and the Kudu backend) is set to {{Y}} milliseconds (default is 65000 > ms). > * The application is running for longer than {{X}} seconds. > * The driver process makes requests to Kudu masters at least every {{Y}} > milliseconds. > * The driver either doesn't make requests to Kudu tablet servers or makes > such requests at least every {{Y}} milliseconds to each of the involved > tablet servers. > * The executors are running tasks that keep connections to tablet servers > idle for longer than {{Y}} milliseconds or the driver spawns tasks at an > executor after {{Y}} milliseconds since last task has completed by the > executor. > Essentially, that's about a Spark Kudu application where the driver process > keeps once opened connections active and the executors need to open new > connections to Kudu tablet servers (and/or masters). Also, the executor > machines doesn't have Kerberos credentials for the OS user under which the > executor processes are run. > In such scenarios, the application's tasks spawned after {{X}} seconds from > the application start will fail because of expired authentication tokens, > while the driver process will never re-acquire its authn token, keeping the > expired token in {{KuduContext}} forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)