[ https://issues.apache.org/jira/browse/KUDU-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249267#comment-17249267 ]
ASF subversion and git services commented on KUDU-3217: ------------------------------------------------------- Commit fcd737e6ccedd5a91367dc119e773f0b1904b4fa in kudu's branch refs/heads/master from Alexey Serbin [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=fcd737e ] [rpc] add slow timing logging for WrapSaslCall In the context of KUDU-3217, this patch adds logging on slow SASL calls. The reason for setting threshold of 250ms is that 250ms is more than enough for a local call, and it should be long enough for a call to a robust KDC in the local network. Change-Id: I3f8cff71c35a4aab0307456b1dc4270520966ece Reviewed-on: http://gerrit.cloudera.org:8080/16868 Reviewed-by: Grant Henke <granthe...@apache.org> Reviewed-by: Bankim Bhavsar <ban...@cloudera.com> Tested-by: Kudu Jenkins > Limit the amount of time spent by SASL method run by WrapSaslCall > ----------------------------------------------------------------- > > Key: KUDU-3217 > URL: https://issues.apache.org/jira/browse/KUDU-3217 > Project: Kudu > Issue Type: Improvement > Components: master, security, tserver > Reporter: Alexey Serbin > Priority: Major > > There were reports on connection negotiation issues in a secure Kudu cluster: > Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a > connection to a tablet server or master. The issue happened rather rarely, > but in a busy cluster that was a major nuisance because the target server > would not accept any new connections for a very long time, and usually the > solution was to restart the server ({{kudu-tserver}} or {{kudu-master}} > correspondingly). > The stack traces collected from the diagnostic files pointed to a situation > where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read > mode and stuck for a long time (several minutes), while the Kerberos creds > renewal thread was waiting on the lock to be acquired in write mode. > Consequently, all other connection negotiation threads were blocked on the > same mutex after that since the mutex is of > {{RWMutex::Priority::PREFER_WRITING}} priority. > The stacks of the related threads looked like the following: > {noformat} > tids=[380992] > 0x7f060e1e45e0 <unknown> > 0x7f060c4afa3d __GI___poll > 0x7f0605813ffa <unknown> > 0x7f06058147b5 <unknown> > 0x7f0605814b07 sss_pac_make_request > 0x7f0605813dab <unknown> > 0x7f060df31deb <unknown> > 0x7f060df53661 <unknown> > 0x7f060df5407a krb5_rd_req_decoded > 0x7f060d3d190c <unknown> > 0x7f060d3d2dea <unknown> > 0x7f060d3d2f49 <unknown> > 0x7f060d3c0606 gss_accept_sec_context > 0x7f060765c04c <unknown> > 0x7f060d610b9b sasl_server_step > 0x7f060d611109 sasl_server_start > tids=[170406,168067,167927,170163,168917,169251] > 0x7f060e1e45e0 <unknown> > 0x7f060e1e0092 __GI___pthread_rwlock_rdlock > 0x1d09883 kudu::rpc::WrapSaslCall() > 0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer() > 0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl() > 0x1d13096 kudu::rpc::ServerNegotiation::Negotiate() > 0x1d2420c kudu::rpc::Negotiation::RunNegotiation() > 0x1cfa905 kudu::internal::Invoker<>::Run() > 0x1e9234f kudu::ThreadPool::DispatchThread() > 0x1e88014 kudu::Thread::SuperviseThread() > 0x7f060e1dce25 start_thread > 0x7f060c4ba34d __clone > tids=[380520] > 0x7f060e1e45e0 <unknown> > 0x7f060e1e02ac __GI___pthread_rwlock_wrlock > 0x1d3b737 kudu::security::(anonymous namespace)::RenewThread() > 0x1e88014 kudu::Thread::SuperviseThread() > 0x7f060e1dce25 start_thread > 0x7f060c4ba34d __clone > {noformat} > Thread {{380992}} is the thread that acquired the mutex as a reader and stuck > in a SASL call (the latter went through the SSSD PAC plugin). Thread > {{380520}} is the Kerberos creds renewal thread, trying to acquire the mutex > as a writer. The rest are connection negotiation threads trying to acquire > the lock as readers. > Further investigation revealed [an issue in > SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck > thread looks exactly the same as the stack of {{380992}} (the latter didn't > have debug symbols to show information on every function in the stack): > {noformat} > #0 0x00007f29342dcdfd in poll () from /lib64/libc.so.6 > #1 0x00007f2901e722ba in sss_cli_make_request_nochecks () from > /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so > #2 0x00007f2901e72a75 in sss_cli_check_socket () from > /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so > #3 0x00007f2901e72e07 in sss_pac_make_request () from > /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so > #4 0x00007f2901e71feb in sssdpac_verify () from > /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so > #5 0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3 > #6 0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3 > #7 0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3 > #8 0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2 > #9 0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from > /lib64/libgssapi_krb5.so.2 > #10 0x00007f292d594359 in krb5_gss_accept_sec_context () from > /lib64/libgssapi_krb5.so.2 > #11 0x00007f292d5816d6 in gss_accept_sec_context () from > /lib64/libgssapi_krb5.so.2 > #12 0x00007f292d7c3edc in gssapi_server_mech_step () from > /usr/lib64/sasl2/libgssapiv2.so > #13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3 > #14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3 > {noformat} > Given that there might be many other bugs in that path and a KDC might be > slow to respond to a particular request, it would be great to limit the > amount of time spent by the SASL call run by {{WrapSaslCall()}}. If it's > over the limit, the code would return {{Status::TimedOut()}} or > {{Status::ServiceUnavailable()}} status and the client side could handle the > response appropriately, but at least Kudu masters and tablet server would be > able to accept new connections and handle those new requests in a timely > manner. > Also, it doesn't seem like a very good idea to acquire a lock and issue a > SASL call since the latter is often turns to be a remote call. -- This message was sent by Atlassian Jira (v8.3.4#803005)