Hi, I'm a new user to Spark and I'm trying to integrate it in my cluster. It's a small set of nodes running CDH 4.7 with kerberos. The other services are fine with the authentication but I've some troubles with spark.
First, I used the parcel available in cloudera manager (SPARK 0.9.0-1.cdh4.6.0.p0.98) Since the cluster has CDH4.7 (not 4.6) I'm not sure if this can create problems. I've also tried with the new spark 1.0.0 with no luck ... I've configured the environment as reported in http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html I'm using a standalone deployment. When launching spark-shell (for testing), everything seems fine (the process got registered with master) But when I try to execute the example reported in the installation page, Kerberos blocks the access to HDFS scala> val file = sc.textFile("hdfs:// m1hadoop.polito.it:8020/user/finamore/data") 14/06/16 22:28:36 INFO storage.MemoryStore: ensureFreeSpace(135653) called with curMem=0, maxMem=308713881 14/06/16 22:28:36 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 132.5 KB, free 294.3 MB) file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12 scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) java.io.IOException: Can't get Master Kerberos principal for use as renewer at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:187) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58) at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:354) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:14) at $iwC$$iwC$$iwC.<init>(<console>:19) at $iwC$$iwC.<init>(<console>:21) at $iwC.<init>(<console>:23) at <init>(<console>:25) at .<init>(<console>:29) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) Of course, I've execute kinit before firing the shell and the user can also access to hdfs from command line. I guess spark is not properly reading the configuration As written in the cloudera documentation, I've specified DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop ...which also has the proper definition of the kerberos principal Any idea of what I'm missing? Thanks! -- -------------------------------------------------- Alessandro Finamore, PhD Politecnico di Torino -- Office: +39 0115644127 Mobile: +39 3280251485 SkypeId: alessandro.finamore ---------------------------------------------------