Update. I've reconfigured the environment to use Spark 1.0.0 and the example finally worked! :)
The different for me was that Spark 1.0.0 requires only to specify the hadoop conf dir (HADOOP_CONF_DIR=/etc/hadoop/conf/) I guess that with 0.9 there were problems in spotting this dir...but I'm not sure why. On 16 June 2014 23:03, Finamore A. <[email protected]> wrote: > Hi, > > I'm a new user to Spark and I'm trying to integrate it in my cluster. > It's a small set of nodes running CDH 4.7 with kerberos. > The other services are fine with the authentication but I've some troubles > with spark. > > First, I used the parcel available in cloudera manager (SPARK > 0.9.0-1.cdh4.6.0.p0.98) > Since the cluster has CDH4.7 (not 4.6) I'm not sure if this can create > problems. > I've also tried with the new spark 1.0.0 with no luck ... > > I've configured the environment as reported in > > http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html > I'm using a standalone deployment. > > When launching spark-shell (for testing), everything seems fine (the > process got registered with master) > But when I try to execute the example reported in the installation page, > Kerberos blocks the access to HDFS > scala> val file = sc.textFile("hdfs:// > m1hadoop.polito.it:8020/user/finamore/data") > 14/06/16 22:28:36 INFO storage.MemoryStore: ensureFreeSpace(135653) called > with curMem=0, maxMem=308713881 > 14/06/16 22:28:36 INFO storage.MemoryStore: Block broadcast_0 stored as > values to memory (estimated size 132.5 KB, free 294.3 MB) > file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at > <console>:12 > > scala> val counts = file.flatMap(line => line.split(" ")).map(word => > (word, 1)).reduceByKey(_ + _) > java.io.IOException: Can't get Master Kerberos principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:187) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) > at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) > at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58) > at > org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:354) > at $iwC$$iwC$$iwC$$iwC.<init>(<console>:14) > at $iwC$$iwC$$iwC.<init>(<console>:19) > at $iwC$$iwC.<init>(<console>:21) > at $iwC.<init>(<console>:23) > at <init>(<console>:25) > at .<init>(<console>:29) > at .<clinit>(<console>) > at .<init>(<console>:7) > at .<clinit>(<console>) > at $print(<console>) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:616) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) > at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604) > at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600) > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > > > Of course, I've execute kinit before firing the shell and the user can > also access to hdfs from command line. > I guess spark is not properly reading the configuration > As written in the cloudera documentation, I've specified > DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop > ...which also has the proper definition of the kerberos principal > > Any idea of what I'm missing? > > Thanks! > > -- > -------------------------------------------------- > Alessandro Finamore, PhD > Politecnico di Torino > -- > Office: +39 0115644127 > Mobile: +39 3280251485 > SkypeId: alessandro.finamore > --------------------------------------------------- > -- -------------------------------------------------- Alessandro Finamore, PhD Politecnico di Torino -- Office: +39 0115644127 Mobile: +39 3280251485 SkypeId: alessandro.finamore ---------------------------------------------------
