Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-17 Thread akhandeshi
Little more progress... I add few enviornment variables, not I get following error message: InvocationTargetException: Can't get Master Kerberos principal for use as renewer -> [Help 1] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kerberos-setup-in-

Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread akhandeshi
Rest of the stacktrace. WARNING] java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke

Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread akhandeshi
I am trying to setup my IDE to a scala spark application. I want to access HDFS files from remote Hadoop server that has Kerberos enabled. My understanding is I should be able to do that from Spark. Here is my code so far: val sparkConf = new SparkConf().setAppName(appName).setMaster(master);

This post has NOT been accepted by the mailing list yet.

2015-10-07 Thread akhandeshi
I seem to see this for many of my posts... does anyone have solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/This-post-has-NOT-been-accepted-by-the-mailing-list-yet-tp24969.html Sent from the Apache Spark User List mailing list archive at Nabble.co

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread akhandeshi
It seems it is failing at path <- tempfile(pattern = "backend_port") I do not see backend_port directory created... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Error-in-sparkR-init-master-local-in-RStudio-tp23768p24958.html Sent from the Apache

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread akhandeshi
I couldn't get this working... I have have JAVA_HOME set. I have defined SPARK_HOME Sys.setenv(SPARK_HOME="c:\DevTools\spark-1.5.1") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library("SparkR", lib.loc="c:\\DevTools\\spark-1.5.1\\lib") library(SparkR) sc<-sparkR.ini

"Loading" status

2015-02-02 Thread akhandeshi
I am not sure what Loading status means, followed by Running. In the application UI, I see: Executor Summary ExecutorID Worker Cores Memory State Logs 1 worker-20150202144112-hadoop-w-1.c.fi-mdd-poc.internal-3887416 83971 LOADING stdout stderr 0 worker-20150202144

ExternalSorter - spilling in-memory map

2015-01-13 Thread akhandeshi
I am using spark 1.2, and I see a lot of messages like: ExternalSorter: Thread 66 spilling in-memory map of 5.0 MB to disk (13160 times so far) I seem to have a lot of memory: URL: spark://hadoop-m:7077 Workers: 4 Cores: 64 Total, 64 Used Memory: 328.0 GB Total, 327.0 GB Used __

Re: Job getting killed

2015-01-13 Thread akhandeshi
Where you able to resolve this issue. I am seeing similar problem! It seems to be connected to using OFF_PEAK persist. Thanks, Ami -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-getting-killed-tp9437p21123.html Sent from the Apache Spark User List m

Re: Any ideas why a few tasks would stall

2014-12-04 Thread akhandeshi
This did not work for me. that is, rdd.coalesce(200, forceShuffle) . Does anyone have ideas on how to distribute your data evenly and co-locate partitions of interest? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-ideas-why-a-few-tasks-would-stall-tp

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
I think, the memory calculation is correct, what I didn't account for is the memory used. I am still puzzled as how I can successfully process the RDD in spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-understanding-Not-enough-space-to-cache-rdd

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
hmm.. 33.6gb is sum of the memory used by the two RDD that is cached. You're right when I put serialized RDDs in the cache, the memory foot print for these rdds become a lot smaller. Serialized Memory footprint shown below: RDD NameStorage Level Cached Partitions Fraction Cached S

Help understanding - Not enough space to cache rdd

2014-12-02 Thread akhandeshi
I am running in local mode. I am using google n1-highmem-16 (16 vCPU, 104 GB memory) machine. I have allocated the SPARK_DRIVER_MEMORY=95g I see Memory: 33.6 GB Used (73.7 GB Total) that the exeuctor is using. In the log out put below, I see 33.6 gb blocks are used by 2 rdds that I have cached.

packaging from source gives protobuf compatibility issues.

2014-12-01 Thread akhandeshi
scala> textFile.count() java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CompleteReques tProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; I tried ./make-distribution.sh -Dhadoop.version=2.5.0 and /usr/local/apache-

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-11-17 Thread akhandeshi
"only option is to split you problem further by increasing parallelism" My understanding is by increasing the number of partitions, is that right? That didn't seem to help because it is seem the partitions are not uniformly sized. My observation is when I increase the number of partitions, it c

Help with processing multiple RDDs

2014-11-11 Thread akhandeshi
I have been struggling to process a set of RDDs. Conceptually, it is is not a large data set. It seems, no matter how much I provide to JVM or partition, I can't seem to process this data. I am caching the RDD. I have tried persit(disk and memory), perist(memory) and persist(off_heap) with no su

SparkSubmitDriverBootstrapper and JVM parameters

2014-11-06 Thread akhandeshi
/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java org.apache.spark.deploy.SparkSubmitDriverBootstrapper When I execute /usr/local/spark-1.1.0/bin/spark-submit local[32] for my app, I see two processes get spun off. One is the org.apache.spark.deploy.SparkSubmitDriverBootstrapper and org.apache.spa

OOM - Requested array size exceeds VM limit

2014-11-03 Thread akhandeshi
I am running local (client). My vm is 16 cpu/108gb ram. My configuration is as following: spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+UseCompressedOops -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+DisableExplicitGC -XX:MaxPermSize=1024m spark.daemon.memory=20g spark.driver.memory

Re: "CANNOT FIND ADDRESS"

2014-11-03 Thread akhandeshi
no luck :(! Still observing the same behavior! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17988.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: "CANNOT FIND ADDRESS"

2014-10-31 Thread akhandeshi
Thanks for the pointers! I did tried but didn't seem to help... In my latest try, I am doing spark-submit local But see the same message in spark App ui (4040) localhost CANNOT FIND ADDRESS In the logs, I see a lot of in-memory map to disk. I don't understand why that is the case. There

Re: "CANNOT FIND ADDRESS"

2014-10-29 Thread akhandeshi
Thanks...hmm It is seems to be a timeout issue perhaps?? Not sure what is causing it? or how to debug? I see following error message... 4/10/29 13:26:04 ERROR ContextCleaner: Error cleaning broadcast 9 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonf

Spark Performance

2014-10-29 Thread akhandeshi
I am relatively new to spark processing. I am using Spark Java API to process data. I am having trouble processing a data set that I don't think is significantly large. It is joining a dataset that is around 3-4gb each (around 12 gb data). The workflow is: x=RDD1.KeyBy(x).partitionBy(new HashP

"CANNOT FIND ADDRESS"

2014-10-29 Thread akhandeshi
SparkApplication UI shows that one of the executor "Cannot find Addresss" Aggregated Metrics by Executor Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Shuffle ReadShuffle

Re: Spark-submt job "Killed"

2014-10-28 Thread akhandeshi
I did have it as rdd.saveAsText("RDD"); and now I have it as: Log.info("RDD Counts"+rdd.persist(StorageLevel.MEMORY_AND_DISK_SER()).count()); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submt-job-Killed-tp17560p17598.html Sent from the Apache Spa

Spark-submt job "Killed"

2014-10-28 Thread akhandeshi
I recently starting seeing this new problem where spark-submt is terminated by "Killed" message but no error message indicating what happened. I have enable logging on in spark configuration. has anyone seen this or know how to troubleshoot? -- View this message in context: http://apache-spar