Re: An attempt to implement dbscan algorithm on top of Spark

2014-06-12 Thread Vipul Pandey
Great! I was going to implement one of my own - but I may not need to do that any more :) I haven't had a chance to look deep into your code but I would recommend accepting an RDD[Double,Double] as well, instead of just a file. val data = IOHelper.readDataset(sc, "/path/to/my/data.csv") And othe

AppMaster OOME on YARN

2014-08-21 Thread Vipul Pandey
Hi, I'm running Spark on YARN carrying out a simple reduceByKey followed by another reduceByKey after some transformations. After completing the first stage my Master runs out of memory. I have 20G assigned to the master, 145 executors (12G each +4G overhead) , around 90k input files, 10+TB d

Re: AppMaster OOME on YARN

2014-08-22 Thread Vipul Pandey
This is all that I see related to spark.MapOutputTrackerMaster in the master logs after OOME 14/08/21 13:24:45 ERROR ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-27] shutting down ActorSystem [spark] java.lang.OutOfMemoryError: Java heap space Exception

GraphX : AssertionError

2014-09-10 Thread Vipul Pandey
Hi, I have a small graph with about 3.3M vertices and close to 7.5M edges. It's a pretty innocent graph with the max degree of 8. Unfortunately, graph.traingleCount is failing on me with the exception below. I'm running a spark-shell on CDH5.1 with the following params : SPARK_DRIVER_MEM=10g A

Re: LZO support in Spark 1.0.0 - nothing seems to work

2014-09-17 Thread Vipul Pandey
It works for me : export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/l

Re: Problem with giving memory to executors on YARN

2014-09-19 Thread Vipul Pandey
How many cores do you have in your boxes? looks like you are assigning 32 cores "per" executor - is that what you want? are there other applications running on the cluster? you might want to check YARN UI to see how many containers are getting allocated to your application. On Sep 19, 2014, a

Re: Logging in Spark through YARN.

2014-09-24 Thread Vipul Pandey
Archit, Are you able to get it to work with 1.0.0? I tried the --files suggestion from Marcelo and it just changed logging for my client and the appmaster and executors were still the same. ~Vipul On Jul 30, 2014, at 9:59 PM, Archit Thakur wrote: > Hi Marcelo, > > Thanks for your quick comme

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-21 Thread Vipul Pandey
any word on this one? I would like to get this done as well. Although, my real use case is to do something on each executor right up in the beginning - and I was trying to hack it using broadcasts by broadcasting an object of my own and do whatever I want in the readObject method. Any other way

Re: Partition sorting by Spark framework

2014-11-05 Thread Vipul Pandey
One option is that after partitioning you call setKeyOrdering explicitly on a new ShuffledRDD : val rdd = // your rdd val srdd = new org.apache.spark.rdd.ShuffledRDD(rdd,rdd.partitioner.get).setKeyOrdering(Ordering[Int]) //assuming the type is Int give it a try and see if it works. I have

Re: Lzo + Protobuf

2014-03-12 Thread Vipul Pandey
f); As you can see this is just a kluge to get things running. Is there a neater way to write out the original "myRDD" as block compressed lzo? Thanks, Vipul On Jan 29, 2014, at 9:40 AM, Issac Buenrostro wrote: > Good! I'll keep your experience in mind in case we have prob

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-30 Thread Vipul Pandey
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. any word on this one? On Mar 27, 2014, at 6:41 PM, Kanwaldeep wrote: > We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar

batching the output

2014-03-30 Thread Vipul Pandey
Hi, I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple "rows" in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same. e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records ins

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
park version and other libraries. > > - Patrick > > > On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey wrote: > I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. > any word on this one? > On Mar 27, 2014, at 6:41 PM, Kanwaldeep wrote: &g

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58) at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:354) On Apr 1, 2014, at 12:53 AM, Vipul Pandey wrote: >> Spark now shades i

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly That's all I do. On Apr 1, 2014, at 11:41 AM, Patrick Wendell wrote: > Vidal - could you show exactly what flags/commands you are using when you > build spark to produce this assembly? > > > On Tue, Apr 1, 2014 at 12

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly On Apr 1, 2014, at 4:13 PM, Patrick Wendell wrote: > Do you get the same problem if you build with maven? > > > On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey wrote: > SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-02 Thread Vipul Pandey
/MessageLiteOrBuilder.class 1112 Wed Apr 02 00:20:00 PDT 2014 com/google/protobuf/MessageOrBuilder.class On Apr 1, 2014, at 11:44 PM, Patrick Wendell wrote: > It's this: mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean package > > > On Tue, Apr 1, 2014 at 5:15 PM, Vi

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-03 Thread Vipul Pandey
Any word on this one ? On Apr 2, 2014, at 12:26 AM, Vipul Pandey wrote: > I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus > generated also has both shaded and real version of protobuf classes > > Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv

Re: different in spark on yarn mode and standalone mode

2014-05-15 Thread Vipul Pandey
So here's a followup question : What's the preferred mode? We have a new cluster coming up with petabytes of data and we intend to take Spark to production. We are trying to figure out what mode would be safe and stable for production like environment. pros and cons? anyone? Any reasons why o

Re: different in spark on yarn mode and standalone mode

2014-05-16 Thread Vipul Pandey
And I thought I sent it to the right list! Here you go again - Question below : On May 14, 2014, at 3:06 PM, Vipul Pandey wrote: > So here's a followup question : What's the preferred mode? > We have a new cluster coming up with petabytes of data and we intend to take >

Re: different in spark on yarn mode and standalone mode

2014-05-16 Thread Vipul Pandey
e number of executors to use. > * YARN is the only cluster manager for Spark that supports security and > Kerberized clusters. > > Some advantages of using standalone: > * It has been around for longer, so it is likely a little more stable. > * Many report faster startup time