Hi everyone, I¹ve been trying to set up Spark so that it can read data from HDFS, when the HDFS cluster is integrated with Kerberos authentication.
I¹ve been using the Spark shell to attempt to read from HDFS, in local mode. I¹ve set all of the appropriate properties in core-site.xml and hdfs-site.xml as I can access and write data using the Hadoop command line utilities. I¹ve also set HADOOP_CONF_DIR to point to the directory where core-site.xml and hdfs-site.xml live. I used UserGroupInformation.setConfiguration(conf) and UserGroupInformation.loginUserFromKeytab() to set up the token, and then SparkContext.newAPIHadoopFile( conf) (instead of SparkContext.textFile() which I would think not pass the appropriate configurations with the Kerberos credentials). When I do that, I get the stack trace (sorry about the color): java.io.IOException: Can't get Master Kerberos principal for use as renewer at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInte rnal(TokenCache.java:116) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInte rnal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(Tok enCache.java:80) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFo rmat.java:242) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFor mat.java:385) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) I was wondering if anyone has had any experience setting up Spark to read from Kerberized HDFS. What configurations needed to be set in spark-env.sh? What am I missing? Also, will I have an issue if I try to access HDFS in distributed mode, using a standalone setup? Thanks, -Matt Cheah
smime.p7s
Description: S/MIME cryptographic signature