Hi,

I'm a new user to Spark and I'm trying to integrate it in my cluster.
It's a small set of nodes running CDH 4.7 with kerberos.
The other services are fine with the authentication but I've some troubles
with spark.

First, I used the parcel available in cloudera manager (SPARK
0.9.0-1.cdh4.6.0.p0.98)
Since the cluster has CDH4.7 (not 4.6) I'm not sure if this can create
problems.
I've also tried with the new spark 1.0.0 with no luck ...

I've configured the environment as reported in
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html
I'm using a standalone deployment.

When launching spark-shell (for testing), everything seems fine (the
process got registered with master)
But when I try to execute the example reported in the installation page,
Kerberos blocks the access to HDFS
scala> val file = sc.textFile("hdfs://
m1hadoop.polito.it:8020/user/finamore/data")
14/06/16 22:28:36 INFO storage.MemoryStore: ensureFreeSpace(135653) called
with curMem=0, maxMem=308713881
14/06/16 22:28:36 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 132.5 KB, free 294.3 MB)
file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
<console>:12

scala> val counts = file.flatMap(line => line.split(" ")).map(word =>
(word, 1)).reduceByKey(_ + _)
java.io.IOException: Can't get Master Kerberos principal for use as renewer
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:187)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at
org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:354)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:14)
at $iwC$$iwC$$iwC.<init>(<console>:19)
at $iwC$$iwC.<init>(<console>:21)
at $iwC.<init>(<console>:23)
at <init>(<console>:25)
at .<init>(<console>:29)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)


Of course, I've execute kinit before firing the shell and the user can also
access to hdfs from command line.
I guess spark is not properly reading the configuration
As written in the cloudera documentation, I've specified
DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
...which also has the proper definition of the kerberos principal

Any idea of what I'm missing?

Thanks!

-- 
--------------------------------------------------
Alessandro Finamore, PhD
Politecnico di Torino
--
Office:    +39 0115644127
Mobile:   +39 3280251485
SkypeId: alessandro.finamore
---------------------------------------------------

Reply via email to