[ 
https://issues.apache.org/jira/browse/KUDU-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3217:
--------------------------------
    Description: 
There were reports on connection negotiation issues in a secure Kudu cluster: 
Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a 
connection to a tablet server or master.  The issue happened rather rarely, but 
in a busy cluster that was a major nuisance because the target server would not 
accept any new connections for a very long time, and usually the solution was 
to restart the server ({{kudu-tserver}} or {{kudu-master}} correspondingly).

The stack traces collected from the diagnostic files pointed to a situation 
where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read 
mode and stuck for a long time (several minutes), while the Kerberos creds 
renewal thread was waiting on the lock to be acquired in write mode.  
Consequently, all other connection negotiation threads were blocked on the same 
mutex after that since the mutex is of {{RWMutex::Priority::PREFER_WRITING}} 
priority.

The stacks of the related threads looked like the following:
{noformat}
  tids=[380992]
      0x7f060e1e45e0 <unknown>
      0x7f060c4afa3d __GI___poll
      0x7f0605813ffa <unknown>
      0x7f06058147b5 <unknown>
      0x7f0605814b07 sss_pac_make_request
      0x7f0605813dab <unknown>
      0x7f060df31deb <unknown>
      0x7f060df53661 <unknown>
      0x7f060df5407a krb5_rd_req_decoded
      0x7f060d3d190c <unknown>
      0x7f060d3d2dea <unknown>
      0x7f060d3d2f49 <unknown>
      0x7f060d3c0606 gss_accept_sec_context
      0x7f060765c04c <unknown>
      0x7f060d610b9b sasl_server_step
      0x7f060d611109 sasl_server_start
  tids=[170406,168067,167927,170163,168917,169251]
      0x7f060e1e45e0 <unknown>
      0x7f060e1e0092 __GI___pthread_rwlock_rdlock
           0x1d09883 kudu::rpc::WrapSaslCall()
           0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
           0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
           0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
           0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
           0x1cfa905 kudu::internal::Invoker<>::Run()
           0x1e9234f kudu::ThreadPool::DispatchThread()
           0x1e88014 kudu::Thread::SuperviseThread()
      0x7f060e1dce25 start_thread
      0x7f060c4ba34d __clone
  tids=[380520]
      0x7f060e1e45e0 <unknown>
      0x7f060e1e02ac __GI___pthread_rwlock_wrlock
           0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
           0x1e88014 kudu::Thread::SuperviseThread()
      0x7f060e1dce25 start_thread
      0x7f060c4ba34d __clone
{noformat} 

Thread {{380992}} is the thread that acquired the mutex as a reader and stuck 
in a SASL call (the latter went through the SSSD PAC plugin).  Thread 
{{380520}} is the Kerberos creds renewal thread, trying to acquire the mutex as 
a writer.  The rest are connection negotiation threads trying to acquire the 
lock as readers.

Further investigation revealed [an issue in 
SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck 
thread looks exactly the same as the stack of {{380992}} (the latter didn't 
have debug symbols to show information on every function in the stack):
{noformat}
#0  0x00007f29342dcdfd in poll () from /lib64/libc.so.6
#1  0x00007f2901e722ba in sss_cli_make_request_nochecks () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#2  0x00007f2901e72a75 in sss_cli_check_socket () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#3  0x00007f2901e72e07 in sss_pac_make_request () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#4  0x00007f2901e71feb in sssdpac_verify () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#5  0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
#6  0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
#7  0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
#8  0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
#9  0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from 
/lib64/libgssapi_krb5.so.2
#10 0x00007f292d594359 in krb5_gss_accept_sec_context () from 
/lib64/libgssapi_krb5.so.2
#11 0x00007f292d5816d6 in gss_accept_sec_context () from 
/lib64/libgssapi_krb5.so.2
#12 0x00007f292d7c3edc in gssapi_server_mech_step () from 
/usr/lib64/sasl2/libgssapiv2.so
#13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
#14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
{noformat}

Given that there might be many other bugs in that path and a KDC might be slow 
to respond to a particular request, it would be great to limit the amount of 
time spent by the SASL call run by {{WrapSaslCall()}}.  If it's over the limit, 
the code would return {{Status::TimedOut()}} or 
{{Status::ServiceUnavailable()}} status and the client side could handle the 
response appropriately, but at least Kudu masters and tablet server would be 
able to accept new connections and handle those new requests in a timely manner.

Also, it doesn't seem like a very good idea to acquire a lock and issue a SASL 
call since the latter is often turns to be a remote call.

  was:
There were reports on connection negotiation issues in a secure Kudu cluster: 
Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a 
connection to a tablet server or master.  The issue happened rather rarely, but 
in a busy cluster that was a major nuisance because the target server (master 
or tablet server) would not accept any new connections for a very long time, 
and usually the solution was to restart the process.

The stack traces collected from the diagnostic files pointed to a situation 
where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read 
mode and stuck for a long time (at least a few minutes), while the Kerberos 
creds renewal thread was waiting on the lock to be acquired in write mode.  
Consequently, all other connection negotiation threads were blocked on the same 
mutex after that since the mutex is of {{RWMutex::Priority::PREFER_WRITING}} 
priority.

The stacks of the related threads looked like the following:
{noformat}
  tids=[380992]
      0x7f060e1e45e0 <unknown>
      0x7f060c4afa3d __GI___poll
      0x7f0605813ffa <unknown>
      0x7f06058147b5 <unknown>
      0x7f0605814b07 sss_pac_make_request
      0x7f0605813dab <unknown>
      0x7f060df31deb <unknown>
      0x7f060df53661 <unknown>
      0x7f060df5407a krb5_rd_req_decoded
      0x7f060d3d190c <unknown>
      0x7f060d3d2dea <unknown>
      0x7f060d3d2f49 <unknown>
      0x7f060d3c0606 gss_accept_sec_context
      0x7f060765c04c <unknown>
      0x7f060d610b9b sasl_server_step
      0x7f060d611109 sasl_server_start
  tids=[170406,168067,167927,170163,168917,169251]
      0x7f060e1e45e0 <unknown>
      0x7f060e1e0092 __GI___pthread_rwlock_rdlock
           0x1d09883 kudu::rpc::WrapSaslCall()
           0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
           0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
           0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
           0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
           0x1cfa905 kudu::internal::Invoker<>::Run()
           0x1e9234f kudu::ThreadPool::DispatchThread()
           0x1e88014 kudu::Thread::SuperviseThread()
      0x7f060e1dce25 start_thread
      0x7f060c4ba34d __clone
  tids=[380520]
      0x7f060e1e45e0 <unknown>
      0x7f060e1e02ac __GI___pthread_rwlock_wrlock
           0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
           0x1e88014 kudu::Thread::SuperviseThread()
      0x7f060e1dce25 start_thread
      0x7f060c4ba34d __clone
{noformat} 

Thread {{80992}} is the thread that acquired the mutex in read mode and stuck 
in a SASL call (the latter went through SSSD).  Thread {{380520}} is the 
Kerberos creds renewal thread, trying to lock the mutex in write mode.  The 
rest are connection negotiation threads trying to acquire the lock in read mode.

Further investigation revealed an [issue in 
SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck 
thread looks exactly the same as the stack of {{80992}} (modulo the latter 
didn't have debug symbols to show information on every function in the stack):
{noformat}
#0  0x00007f29342dcdfd in poll () from /lib64/libc.so.6
#1  0x00007f2901e722ba in sss_cli_make_request_nochecks () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#2  0x00007f2901e72a75 in sss_cli_check_socket () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#3  0x00007f2901e72e07 in sss_pac_make_request () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#4  0x00007f2901e71feb in sssdpac_verify () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#5  0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
#6  0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
#7  0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
#8  0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
#9  0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from 
/lib64/libgssapi_krb5.so.2
#10 0x00007f292d594359 in krb5_gss_accept_sec_context () from 
/lib64/libgssapi_krb5.so.2
#11 0x00007f292d5816d6 in gss_accept_sec_context () from 
/lib64/libgssapi_krb5.so.2
#12 0x00007f292d7c3edc in gssapi_server_mech_step () from 
/usr/lib64/sasl2/libgssapiv2.so
#13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
#14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
{noformat}

With that, given that there might be many other bugs in that path and a remote 
KDC might be very slow to respond if overloaded in a larger cluster, it would 
be great to limit the amount of time spent by the SASL call run by 
{{WrapSaslCall()}}, so if it's over the limit, the code would return 
{{Status::TimedOut()}} or {{Status::ServiceUnavailable()}} status.

Also, it doesn't seem like a very good idea to lock a mutex and issue a SASL 
call since the latter is often turns to be a call to remote KDC.


> Limit the amount of time spent by SASL method run by WrapSaslCall
> -----------------------------------------------------------------
>
>                 Key: KUDU-3217
>                 URL: https://issues.apache.org/jira/browse/KUDU-3217
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master, security, tserver
>            Reporter: Alexey Serbin
>            Priority: Major
>
> There were reports on connection negotiation issues in a secure Kudu cluster: 
> Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a 
> connection to a tablet server or master.  The issue happened rather rarely, 
> but in a busy cluster that was a major nuisance because the target server 
> would not accept any new connections for a very long time, and usually the 
> solution was to restart the server ({{kudu-tserver}} or {{kudu-master}} 
> correspondingly).
> The stack traces collected from the diagnostic files pointed to a situation 
> where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read 
> mode and stuck for a long time (several minutes), while the Kerberos creds 
> renewal thread was waiting on the lock to be acquired in write mode.  
> Consequently, all other connection negotiation threads were blocked on the 
> same mutex after that since the mutex is of 
> {{RWMutex::Priority::PREFER_WRITING}} priority.
> The stacks of the related threads looked like the following:
> {noformat}
>   tids=[380992]
>       0x7f060e1e45e0 <unknown>
>       0x7f060c4afa3d __GI___poll
>       0x7f0605813ffa <unknown>
>       0x7f06058147b5 <unknown>
>       0x7f0605814b07 sss_pac_make_request
>       0x7f0605813dab <unknown>
>       0x7f060df31deb <unknown>
>       0x7f060df53661 <unknown>
>       0x7f060df5407a krb5_rd_req_decoded
>       0x7f060d3d190c <unknown>
>       0x7f060d3d2dea <unknown>
>       0x7f060d3d2f49 <unknown>
>       0x7f060d3c0606 gss_accept_sec_context
>       0x7f060765c04c <unknown>
>       0x7f060d610b9b sasl_server_step
>       0x7f060d611109 sasl_server_start
>   tids=[170406,168067,167927,170163,168917,169251]
>       0x7f060e1e45e0 <unknown>
>       0x7f060e1e0092 __GI___pthread_rwlock_rdlock
>            0x1d09883 kudu::rpc::WrapSaslCall()
>            0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
>            0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
>            0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
>            0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
>            0x1cfa905 kudu::internal::Invoker<>::Run()
>            0x1e9234f kudu::ThreadPool::DispatchThread()
>            0x1e88014 kudu::Thread::SuperviseThread()
>       0x7f060e1dce25 start_thread
>       0x7f060c4ba34d __clone
>   tids=[380520]
>       0x7f060e1e45e0 <unknown>
>       0x7f060e1e02ac __GI___pthread_rwlock_wrlock
>            0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
>            0x1e88014 kudu::Thread::SuperviseThread()
>       0x7f060e1dce25 start_thread
>       0x7f060c4ba34d __clone
> {noformat} 
> Thread {{380992}} is the thread that acquired the mutex as a reader and stuck 
> in a SASL call (the latter went through the SSSD PAC plugin).  Thread 
> {{380520}} is the Kerberos creds renewal thread, trying to acquire the mutex 
> as a writer.  The rest are connection negotiation threads trying to acquire 
> the lock as readers.
> Further investigation revealed [an issue in 
> SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck 
> thread looks exactly the same as the stack of {{380992}} (the latter didn't 
> have debug symbols to show information on every function in the stack):
> {noformat}
> #0  0x00007f29342dcdfd in poll () from /lib64/libc.so.6
> #1  0x00007f2901e722ba in sss_cli_make_request_nochecks () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #2  0x00007f2901e72a75 in sss_cli_check_socket () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #3  0x00007f2901e72e07 in sss_pac_make_request () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #4  0x00007f2901e71feb in sssdpac_verify () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #5  0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
> #6  0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
> #7  0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
> #8  0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
> #9  0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from 
> /lib64/libgssapi_krb5.so.2
> #10 0x00007f292d594359 in krb5_gss_accept_sec_context () from 
> /lib64/libgssapi_krb5.so.2
> #11 0x00007f292d5816d6 in gss_accept_sec_context () from 
> /lib64/libgssapi_krb5.so.2
> #12 0x00007f292d7c3edc in gssapi_server_mech_step () from 
> /usr/lib64/sasl2/libgssapiv2.so
> #13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
> #14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
> {noformat}
> Given that there might be many other bugs in that path and a KDC might be 
> slow to respond to a particular request, it would be great to limit the 
> amount of time spent by the SASL call run by {{WrapSaslCall()}}.  If it's 
> over the limit, the code would return {{Status::TimedOut()}} or 
> {{Status::ServiceUnavailable()}} status and the client side could handle the 
> response appropriately, but at least Kudu masters and tablet server would be 
> able to accept new connections and handle those new requests in a timely 
> manner.
> Also, it doesn't seem like a very good idea to acquire a lock and issue a 
> SASL call since the latter is often turns to be a remote call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to