On 19 Oct 2016, at 00:18, Michael Segel <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
(Sorry sent reply via wrong account.. ) Steve, Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-) Usually you will end up having a local Kerberos set up per cluster. So your machine accounts (hive, yarn, hbase, etc …) are going to be local to the cluster. not necessarily...you can share a KDC. And in a land of active directory you'd need some trust So you will have to set up some sort of realm trusts between the clusters. If you’re going to be setting up security (Kerberos … ick! shivers… ;-) you’re going to want to keep the machine accounts isolated to the cluster. And the OP said that he didn’t control the other cluster which makes me believe that they are separate. good point; you may not be able to get the tickets for cluster C accounts. But if you can log in as a user for I would also think that you would have trouble with the credential… isn’t is tied to a user at a specific machine? there are two types of kerberos identity, simple "hdfs@REALM" and server-specific "hdfs/server@REALM". The simple ones work just as well in small clusters, it's just that in larger clusters your KDCs (especially AD) tend to interpret an attempt by 200 machines to log in as user "hdfs@REALM" in 30s as an attempt to brute force a password, and start rejecting logins. The separation into hdfs/_HOST_/REALM style avoids that, and may reduce the damage if the keytab leaks If the user submitting work is logged into the KDC of cluster C, e.g: kinit user@CLUSTERC and spark is configured to ask for the extra namenode tokens, spark.yarn.access.namenodes hdfs://cluster-c:8020 ..then spark MAY ask for those tokens, pass them up to cluster B and so have them available for talking to cluster C. The submitted job is using the block tokens, so doesn't need to log in to kerberos itself, and if cluster B is insecure, doesn't need to worry about credentials and identity there. The HDFS client code just returns the block token to talk to cluster C when an attempt to talk to the DN of cluster C is rejected with an "authenticate yourself" response. The main issue to me is: will that token get picked up and propagated to an insecure cluster, so as to support this operation? Because there's a risk that the ubiquitous static method, UserGroupInformation.isSecurityEnabled() is being checked in places, and as the cluster itself isn't secure (hadoop.security.authentication in core-site.xml != "simple"). It looks like org.apache.spark.deploy.yarn.security.HDFSCredentialProvider is doing exactly that (as does HBase and Hive), meaning job submission doesn't fetch tokens unless the destination cluster is secure. One thing that could be attempted, would be turning authentication on to kerberos just in the job launch config, and seeing if that will collect all required tokens *without* getting confused by the fact that YARN and HDFS don't need them. spark.hadoop.hadoop.security.authentication I have no idea if this works; you've have to try it and see (Its been a while since I looked at this and I drank heavily to forget Kerberos… so I may be a bit fuzzy here.) denying all knowledge of Kerberos is always a good tactic.