Re: Spark driver getting out of memory

2016-07-24 Thread Raghava Mutharaju
Saurav, We have the same issue. Our application runs fine on 32 nodes with 4 cores each and 256 partitions but gives an OOM on the driver when run on 64 nodes with 512 partitions. Did you get to know the reason behind this behavior or the relation between number of partitions and driver RAM usage?

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
lly on OutOfMemory error. Analyze it and > share your finding :) > > > > On Wed, Jun 22, 2016 at 4:33 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Ok. Would be able to shed more light on what exact meta data it manages >> and what is the relati

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
age scheduling across all > your executors. I assume with 64 nodes you have more executors as well. > Simple way to test is to increase driver memory. > > On Wed, Jun 22, 2016 at 10:10 AM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> It is an i

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Wed, Jun 22, 2016 at 9:57 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Hello All, >> >> We have a S

OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
Hello All, We have a Spark cluster where driver and master are running on the same node. We are using Spark Standalone cluster manager. If the number of nodes (and the partitions) are increased, the same dataset that used to run to completion on lesser number of nodes is now giving an out of memor

Spark 2.0.0-snapshot: IllegalArgumentException: requirement failed: chunks must be non-empty

2016-05-13 Thread Raghava Mutharaju
(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) On Fri, May 13, 2016 at 6:33 AM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > Thank you for the response. > > I use

Re: sbt for Spark build with Scala 2.11

2016-05-13 Thread Raghava Mutharaju
some modules/profiles for your build. What command did > you use to build ? > > On Thu, May 12, 2016 at 9:01 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Hello All, >> >> I built Spark from the source code available at >> https://github.com

sbt for Spark build with Scala 2.11

2016-05-12 Thread Raghava Mutharaju
Hello All, I built Spark from the source code available at https://github.com/apache/spark/. Although I haven't specified the "-Dscala-2.11" option (to build with Scala 2.11), from the build messages I see that it ended up using Scala 2.11. Now, for my application sbt, what should be the spark ver

Re: partitioner aware subtract

2016-05-10 Thread Raghava Mutharaju
(leftItr, rightItr) => > leftItr.filter(p => !rightItr.contains(p)) > } > sum.foreach(println) > > > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > https://in.linkedin.com/in/rishiteshmishra > > On Mon, May 9, 2016 at 7:35

Re: partitioner aware subtract

2016-05-09 Thread Raghava Mutharaju
at 6:27 AM, ayan guha wrote: > How about outer join? > On 9 May 2016 13:18, "Raghava Mutharaju" > wrote: > >> Hello All, >> >> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key >> (number of partitions are same for both the RDDs).

partitioner aware subtract

2016-05-08 Thread Raghava Mutharaju
Hello All, We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key (number of partitions are same for both the RDDs). We would like to subtract rdd2 from rdd1. The subtract code at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala seems to

Re: executor delay in Spark

2016-04-28 Thread Raghava Mutharaju
ection, which would, I think, only > complicate your trace debugging. > > I've attached a python script that dumps relevant info from the Spark > JSON logs into a CSV for easier analysis in you language of choice; > hopefully it can aid in finer grained debugging (the headers of the

Re: executor delay in Spark

2016-04-24 Thread Raghava Mutharaju
this other than creating a dummy task to synchronize the executors, but > hopefully someone from there can suggest other possibilities. > > Mike > On Apr 23, 2016 5:53 AM, "Raghava Mutharaju" > wrote: > >> Mike, >> >> It turns out the executor delay, as

Re: executor delay in Spark

2016-04-22 Thread Raghava Mutharaju
a dummy task to synchronize the executors, but > hopefully someone from there can suggest other possibilities. > > Mike > On Apr 23, 2016 5:53 AM, "Raghava Mutharaju" > wrote: > >> Mike, >> >> It turns out the executor delay, as you mentioned, is the cau

executor delay in Spark

2016-04-22 Thread Raghava Mutharaju
gt; 3. Make the tasks longer, i.e. with some silly computational work. > > Mike > > > On 4/17/16, Raghava Mutharaju wrote: > > Yes its the same data. > > > > 1) The number of partitions are the same (8, which is an argument to the > > HashPartitioner). In

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju
om/in/sonalgoyal> > > > > On Mon, Apr 18, 2016 at 6:26 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Mike, >> >> We tried that. This map task is actually part of a larger set of >> operations. I pointed out this map task since i

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju
onfirm: > >> 1. Examine the starting times of the tasks alongside their > >> executor > >> 2. Make a "dummy" stage execute before your real stages to > >> synchronize the executors by creating and materializing any random RDD > >&g

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
max will help. Also, for > 52MB of data you need not have 12GB allocated to executors. Better to > assign 512MB or so and increase the number of executors per worker node. > Try reducing that executor memory to 512MB or so for this case. > > On Mon, Apr 18, 2016 at 9:07 AM,

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
un via Scala program? > > On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Hello All, >> >> We are using HashPartitioner in the following way on a 3 node cluster (1 >> master and 2 worker nodes). >> >&

strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
Hello All, We are using HashPartitioner in the following way on a 3 node cluster (1 master and 2 worker nodes). val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt, x.toInt) } }).partitionBy(new HashParti

DataFrames - Kryo registration issue

2016-03-10 Thread Raghava Mutharaju
Hello All, If Kryo serialization is enabled, doesn't Spark take care of registration of built-in classes, i.e., are we not supposed to register just the custom classes? When using DataFrames, this does not seem to be the case. I had to register the following classes conf.registerKryoClasses(Arra

Dataset takes more memory compared to RDD

2016-02-12 Thread Raghava Mutharaju
Hello All, I implemented an algorithm using both the RDDs and the Dataset API (in Spark 1.6). Dataset version takes lot more memory than the RDDs. Is this normal? Even for very small input data, it is running out of memory and I get a java heap exception. I tried the Kryo serializer by registerin

Re: Dataset joinWith condition

2016-02-10 Thread Raghava Mutharaju
"...") > e.g. if your DataSet has two columns, you can write: > ds.select(expr("_2 / _1").as[Int]) > > where _1 refers to first column and _2 refers to second. > > On Tue, Feb 9, 2016 at 3:31 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: &

Re: Dataset joinWith condition

2016-02-09 Thread Raghava Mutharaju
> ds1.joinWith(ds2, $"a.value" === $"b.value", "inner"), > > On Tue, Feb 9, 2016 at 7:07 AM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Hello All, >> >> joinWith() method in Dataset takes a condition of type C

Dataset joinWith condition

2016-02-09 Thread Raghava Mutharaju
Hello All, joinWith() method in Dataset takes a condition of type Column. Without converting a Dataset to a DataFrame, how can we get a specific column? For eg: case class Pair(x: Long, y: Long) A, B are Datasets of type Pair and I want to join A.x with B.y A.joinWith(B, A.toDF().col("x") == B.