Re: Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-18 Thread abhiguruvayya
Hello Mayur, #3 in the new RangePartitioner(*3*, partitionedFile); is also a hard coded value for the number of partitions. Do you find a way where i can avoid that. And including the cluster size, partitions depends on the input data size also. Correct me if i am wrong. -- View this message in

Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-15 Thread abhiguruvayya
My use case as mentioned below. 1. Read input data from local file system using sparkContext.textFile(input path). 2. partition the input data(80 million records) into partitions using RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer function. Without using coalesce() or rep

Re: Spark job tracker.

2014-08-04 Thread abhiguruvayya
I am trying to create a asynchronous thread using Java executor service and launching the javaSparkContext in this thread. But it is failing with exit code 0(zero). I basically want to submit spark job in one thread and continue doing something else after submitting. Any help on this? Thanks. --

mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-24 Thread abhiguruvayya
Can any one help me understand the key difference between mapToPair vs flatMapToPair vs flatMap functions and also when to apply these functions in particular. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mapToPair-vs-flatMapToPair-vs-flatMap-function-usa

Re: Spark job tracker.

2014-07-23 Thread abhiguruvayya
Is there any thing equivalent to haddop "Job" (org.apache.hadoop.mapreduce.Job;) in spark? Once i submit the spark job i want to concurrently read the sparkListener interface implementation methods where i can grab the job status. I am trying to find a way to wrap the spark submit object into one t

Re: Spark job tracker.

2014-07-22 Thread abhiguruvayya
Thanks i am able to load the file now. Can i turn off specific logs using log4j.properties. I don't want to see the below logs. How can i do this. 14/07/22 14:01:24 INFO scheduler.TaskSetManager: Starting task 2.0:129 as TID 129 on executor 3: ** (NODE_LOCAL) 14/07/22 14:01:24 INFO scheduler.T

Re: Spark job tracker.

2014-07-22 Thread abhiguruvayya
I fixed the error with the yarn-client mode issue which i mentioned in my earlier post. Now i want to edit the log4j.properties to filter some of the unnecessary logs. Can you let me know where can i find this properties file. -- View this message in context: http://apache-spark-user-list.10015

Need info on log4j.properties for apache spark.

2014-07-22 Thread abhiguruvayya
Hello All, Basically i need to edit the log4j.properties to filter some of the unnecessary logs in spark on yarn-client mode. I am not sure where can i find log4j.properties file (location). Can any one help me on this. -- View this message in context: http://apache-spark-user-list.1001560.n3.

Re: Spark job tracker.

2014-07-21 Thread abhiguruvayya
An also i am facing one issue. If i run the program in yarn-cluster mode it works absolutely fine but if i change it to yarn-client mode i get this below error. Application application_1405471266091_0055 failed 2 times due to AM Container for appattempt_1405471266091_0055_02 exited with exitCo

Re: Spark job tracker.

2014-07-20 Thread abhiguruvayya
Hello Marcelo Vanzin, Can you explain bit more on this? I tried using client mode but can you explain how can i use this port to write the log or output to this port?Thanks in advance! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367

Re: Spark job tracker.

2014-07-10 Thread abhiguruvayya
Hi Mayur, Thanks so much for the explanation. It did help me. Is there a way i can log these details on the console rather than logging it. As of now once i start my application i could see this, 14/07/10 00:48:20 INFO yarn.Client: Application report from ASM: application identifier: ap

Re: Spark job tracker.

2014-07-08 Thread abhiguruvayya
Hello Mayur, How can I implement these methods mentioned below. Do u you have any clue on this pls et me know. public void onJobStart(SparkListenerJobStart arg0) { } @Override public void onStageCompleted(SparkListenerStageCompleted arg0) { }

Re: Spark job tracker.

2014-07-02 Thread abhiguruvayya
Spark displays job status information on port 4040 using JobProgressListener, any one knows how to hook into this port and read the details? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8697.html Sent from the Apache Spark User Li

Re: Spark job tracker.

2014-06-27 Thread abhiguruvayya
I know this is a very trivial question to ask but I'm a complete new bee to this stuff so i don't have ne clue on this. Any help is much appreciated. For example if i have a class like below, and when i run this through command line i want to see progress status. some thing like, 10% completed..

Re: Spark job tracker.

2014-06-27 Thread abhiguruvayya
Hello Mayur, Are you using SparkListener interface java API? I tried using it but was unsuccessful. So need few more inputs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8438.html Sent from the Apache Spark User List mailing list

Re: Spark job tracker.

2014-06-26 Thread abhiguruvayya
I don't want to track it on the cluster UI. Once i launch the job i would to like to print the status. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8370.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Spark job tracker.

2014-06-26 Thread abhiguruvayya
How to track map/reduce task in realtime. In hadoop map/reduce i am doing it by creating a job and printing the status of the running application in real time. Is there a similar way to do this in spark? Please let me know. -- View this message in context: http://apache-spark-user-list.1001560.

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread abhiguruvayya
Does JavaPairRDD.saveAsHadoopFile store data as a sequenceFile? Then what is the significance of RDD.saveAsSequenceFile? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7983.html Sent from t

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread abhiguruvayya
Any inputs on this will be helpful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7980.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread abhiguruvayya
No. My understanding by reading the code is that RDD.saveAsObjectFile uses Java Serialization and RDD.saveAsSequenceFile uses Writable which is tied to the Writable Serialization framework in HDFS. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-sto

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-19 Thread abhiguruvayya
Once you have generated the final RDD before submitting it to reducer try to repartition the RDD either using coalesce(partitions) or repartition() into known partitions. 2. Rule of thumb to create number of data partitions (3 * num_executors * cores_per_executor). -- View this message in conte

How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread abhiguruvayya
I want to store JavaRDD as a sequence file instead of textfile. But i don't see any Java API for that. Is there a way for this? Please let me know. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-ja

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
Perfect!! That makes so much sense to me now. Thanks a ton -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7793.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I found the main reason to be that i was using coalesce instead of repartition. coalesce was shrinking the portioning so the number of tasks were very less to be executed by all of the executors. Can you help me in understudying when to use coalesce and when to use repartition. In application coale

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
My use case was to read 3000 files from 3000 different HDFS directories so i was reading each file and creating RDD and adding it to array of JavaRDD then do a union(rdd...). Because of this my prog was very slow(5 minutes). After i replaced this logic with textFile(path1,path2,path3) it is working

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I did try creating more partitions by overriding the default number of partitions determined by HDFS splits. Problem is, in this case program will run for ever. I have same set of inputs for map reduce and spark. Where map reduce is taking 2 mins, spark is taking 5 min to complete the job. I though

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
Can some one help me with this. Any help is appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7753.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I am creating around 10 executors with 12 cores and 7g memory, but when i launch a task not all executors are being used. For example if my job has 9 tasks, only 3 executors are being used with 3 task each and i believe this is making my app slower than map reduce program for the same use case. Can

Re: Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread abhiguruvayya
Try repartitioning the RDD using coalsce(int partitions) before performing any transforms. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-tp7735p7736.html Sent from the Apache Spark User List mailing li