Great! I was going to implement one of my own - but I may not need to do that
any more :)
I haven't had a chance to look deep into your code but I would recommend
accepting an RDD[Double,Double] as well, instead of just a file.
val data = IOHelper.readDataset(sc, "/path/to/my/data.csv")
And othe
Hi,
I'm running Spark on YARN carrying out a simple reduceByKey followed by another
reduceByKey after some transformations. After completing the first stage my
Master runs out of memory.
I have 20G assigned to the master, 145 executors (12G each +4G overhead) ,
around 90k input files, 10+TB d
This is all that I see related to spark.MapOutputTrackerMaster in the master
logs after OOME
14/08/21 13:24:45 ERROR ActorSystemImpl: Uncaught fatal error from thread
[spark-akka.actor.default-dispatcher-27] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: Java heap space
Exception
Hi,
I have a small graph with about 3.3M vertices and close to 7.5M edges. It's a
pretty innocent graph with the max degree of 8.
Unfortunately, graph.traingleCount is failing on me with the exception below.
I'm running a spark-shell on CDH5.1 with the following params :
SPARK_DRIVER_MEM=10g A
It works for me :
export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native
export
SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/l
How many cores do you have in your boxes?
looks like you are assigning 32 cores "per" executor - is that what you want?
are there other applications running on the cluster? you might want to check
YARN UI to see how many containers are getting allocated to your application.
On Sep 19, 2014, a
Archit, Are you able to get it to work with 1.0.0?
I tried the --files suggestion from Marcelo and it just changed logging for my
client and the appmaster and executors were still the same.
~Vipul
On Jul 30, 2014, at 9:59 PM, Archit Thakur wrote:
> Hi Marcelo,
>
> Thanks for your quick comme
any word on this one? I would like to get this done as well.
Although, my real use case is to do something on each executor right up in the
beginning - and I was trying to hack it using broadcasts by broadcasting an
object of my own and do whatever I want in the readObject method.
Any other way
One option is that after partitioning you call setKeyOrdering explicitly on a
new ShuffledRDD :
val rdd = // your rdd
val srdd = new
org.apache.spark.rdd.ShuffledRDD(rdd,rdd.partitioner.get).setKeyOrdering(Ordering[Int])
//assuming the type is Int
give it a try and see if it works. I have
f);
As you can see this is just a kluge to get things running. Is there a neater
way to write out the original "myRDD" as block compressed lzo?
Thanks,
Vipul
On Jan 29, 2014, at 9:40 AM, Issac Buenrostro wrote:
> Good! I'll keep your experience in mind in case we have prob
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue.
any word on this one?
On Mar 27, 2014, at 6:41 PM, Kanwaldeep wrote:
> We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with
> Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar
Hi,
I need to batch the values in my final RDD before writing out to hdfs. The idea
is to batch multiple "rows" in a protobuf and write those batches out - mostly
to save some space as a lot of metadata is the same.
e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records
ins
park version and other libraries.
>
> - Patrick
>
>
> On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey wrote:
> I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue.
> any word on this one?
> On Mar 27, 2014, at 6:41 PM, Kanwaldeep wrote:
&g
org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at
org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at
org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:354)
On Apr 1, 2014, at 12:53 AM, Vipul Pandey wrote:
>> Spark now shades i
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly
That's all I do.
On Apr 1, 2014, at 11:41 AM, Patrick Wendell wrote:
> Vidal - could you show exactly what flags/commands you are using when you
> build spark to produce this assembly?
>
>
> On Tue, Apr 1, 2014 at 12
mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly
On Apr 1, 2014, at 4:13 PM, Patrick Wendell wrote:
> Do you get the same problem if you build with maven?
>
>
> On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey wrote:
> SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly
/MessageLiteOrBuilder.class
1112 Wed Apr 02 00:20:00 PDT 2014 com/google/protobuf/MessageOrBuilder.class
On Apr 1, 2014, at 11:44 PM, Patrick Wendell wrote:
> It's this: mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean package
>
>
> On Tue, Apr 1, 2014 at 5:15 PM, Vi
Any word on this one ?
On Apr 2, 2014, at 12:26 AM, Vipul Pandey wrote:
> I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus
> generated also has both shaded and real version of protobuf classes
>
> Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv
So here's a followup question : What's the preferred mode?
We have a new cluster coming up with petabytes of data and we intend to take
Spark to production. We are trying to figure out what mode would be safe and
stable for production like environment.
pros and cons? anyone?
Any reasons why o
And I thought I sent it to the right list! Here you go again - Question below :
On May 14, 2014, at 3:06 PM, Vipul Pandey wrote:
> So here's a followup question : What's the preferred mode?
> We have a new cluster coming up with petabytes of data and we intend to take
>
e number of executors to use.
> * YARN is the only cluster manager for Spark that supports security and
> Kerberized clusters.
>
> Some advantages of using standalone:
> * It has been around for longer, so it is likely a little more stable.
> * Many report faster startup time
21 matches
Mail list logo