Re: spark with kerberos

Steve Loughran Wed, 19 Oct 2016 05:10:50 -0700

On 19 Oct 2016, at 00:18, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:


(Sorry sent reply via wrong account.. )

Steve,

Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-)

Usually you will end up having a local Kerberos set up per cluster.
So your machine accounts (hive, yarn, hbase, etc …) are going to be local  to 
the cluster.


not necessarily...you can share a KDC. And in a land of active directory you'd 
need some trust



So you will have to set up some sort of realm trusts between the clusters.

If you’re going to be setting up security (Kerberos … ick! shivers… ;-) you’re 
going to want to keep the machine accounts isolated to the cluster.
And the OP said that he didn’t control the other cluster which makes me believe 
that they are separate.


good point; you may not be able to get the tickets for cluster C accounts. But 
if you can log in as a user for


I would also think that you would have trouble with the credential… isn’t is 
tied to a user at a specific machine?

there are two types of kerberos identity, simple "hdfs@REALM" and 
server-specific "hdfs/server@REALM". The simple ones work just as well in small 
clusters, it's just that in larger clusters your KDCs (especially AD) tend to 
interpret an attempt by 200 machines to log in as user "hdfs@REALM" in 30s as 
an attempt to brute force a password, and start rejecting logins. The 
separation into hdfs/_HOST_/REALM style avoids that, and may reduce the damage 
if the keytab leaks

If the user submitting work is logged into the KDC of cluster C, e.g:


kinit user@CLUSTERC


and spark is configured to ask for the extra namenode tokens,

spark.yarn.access.namenodes hdfs://cluster-c:8020


..then spark MAY ask for those tokens, pass them up to cluster B and so have 
them available for talking to cluster C. The submitted job is using the block 
tokens, so doesn't need to log in to kerberos itself, and if cluster B is 
insecure, doesn't need to worry about credentials and identity there. The HDFS 
client code just returns the block token to talk to cluster C when an attempt 
to talk to the DN of cluster C is rejected with an "authenticate yourself" 
response.

The main issue to me is: will that token get picked up and propagated to an 
insecure cluster, so as to support this operation? Because there's a risk that 
the ubiquitous static method, UserGroupInformation.isSecurityEnabled() is being 
checked in places, and as the cluster itself isn't secure 
(hadoop.security.authentication  in core-site.xml != "simple"). It looks like 
org.apache.spark.deploy.yarn.security.HDFSCredentialProvider is doing exactly 
that (as does HBase and Hive), meaning job submission doesn't fetch tokens 
unless the destination cluster is secure.

One thing that could be attempted, would be turning authentication on to 
kerberos just in the job launch config, and seeing if that will collect all 
required tokens *without* getting confused by the fact that YARN and HDFS don't 
need them.

spark.hadoop.hadoop.security.authentication

I have no idea if this works; you've have to try it and see

(Its been a while since I looked at this and I drank heavily to forget 
Kerberos… so I may be a bit fuzzy here.)


denying all knowledge of Kerberos is always a good tactic.

Re: spark with kerberos

Reply via email to