Re: cutting 1.6.3 release candidate

2016-10-14 Thread Alexander Pivovarov
Hi Reynold Spark 1.6.x has serious bug related to shuffle functionality https://issues.apache.org/jira/browse/SPARK-14560 https://issues.apache.org/jira/browse/SPARK-4452 Shuffle throws OOM on serious load. I've seen this error several times on my heavy jobs java.lang.OutOfMemoryError: Unable to

Re: cutting 1.6.3 release candidate

2016-10-14 Thread Alexander Pivovarov
Also, can you include MaxPermSize fix to spark-1.6.3? https://issues.apache.org/jira/browse/SPARK-15067 Literally, just 1 word should be replaced https://github.com/apache/spark/pull/12985/files On Fri, Oct 14, 2016 at 1:57 PM, Alexander Pivovarov wrote: > Hi Reynold > > Spark

How to add 1.5.2 support to ec2/spark_ec2.py ?

2015-11-25 Thread Alexander Pivovarov
Hi Everyone I noticed that spark ec2 script is outdated. How to add 1.5.2 support to ec2/spark_ec2.py? What else (except of updating spark version in the script) should be done to add 1.5.2 support? We also need to update scala to 2.10.4 (currently it's 2.10.3) Alex

Re: How to add 1.5.2 support to ec2/spark_ec2.py ?

2015-11-30 Thread Alexander Pivovarov
just want to follow up On Nov 25, 2015 9:19 PM, "Alexander Pivovarov" wrote: > Hi Everyone > > I noticed that spark ec2 script is outdated. > How to add 1.5.2 support to ec2/spark_ec2.py? > What else (except of updating spark version in the script) should be done >

Re: How to add 1.5.2 support to ec2/spark_ec2.py ?

2015-12-01 Thread Alexander Pivovarov
> https://github.com/apache/spark/commit/97956669053646f00131073358e53b05d0c3d5d0#diff-ada66bbeb2f1327b508232ef6c3805a5 > to the master branch as well > > Thanks > Shivaram > > > > On Mon, Nov 30, 2015 at 11:38 PM, Alexander Pivovarov > wrote: > > just want to

Re: How to add 1.5.2 support to ec2/spark_ec2.py ?

2015-12-01 Thread Alexander Pivovarov
Did 6 min ago On Tue, Dec 1, 2015 at 12:49 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Yeah - that needs to be changed as well. Could you send a PR to fix this ? > > Shivaram > > On Tue, Dec 1, 2015 at 12:32 AM, Alexander Pivovarov > wrote: > &

[no subject]

2015-12-01 Thread Alexander Pivovarov

Spark fails after 6000s because of akka

2015-12-20 Thread Alexander Pivovarov
I run Spark 1.5.2 on YARN (EMR) I noticed that my long running jobs always failed after 1h 40 min (6000s) with the exceptions below. Then I found that Spark has spark.akka.heartbeat.pauses=6000s by default I changed the settings to the following and it solve my issue. "spark.akka.heartbeat.pau

Re: Spark fails after 6000s because of akka

2015-12-20 Thread Alexander Pivovarov
more On Sun, Dec 20, 2015 at 10:42 AM, Alexander Pivovarov wrote: > I run Spark 1.5.2 on YARN (EMR) > > I noticed that my long running jobs always failed after 1h 40 min (6000s) > with the exceptions below. > > Then I found that Spark has spark.akka.heartbeat.pauses=6000s

Re: Spark fails after 6000s because of akka

2015-12-20 Thread Alexander Pivovarov
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Sun, Dec 20, 2015 at 11:28 AM, Alexander Pivovarov wrote: > it can also fail with the following message > > Exception in thread "main" org.apache.spar

Re: Spark fails after 6000s because of akka

2015-12-20 Thread Alexander Pivovarov
it$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Command exiting with ret '1' On Sun, Dec 20, 2015 at 11:29 AM, Alexander Pivovarov wrote: > Or this m

Re: Spark fails after 6000s because of akka

2015-12-20 Thread Alexander Pivovarov
scover / track? Thanks! > > On Sun, Dec 20, 2015 at 11:35 AM Alexander Pivovarov > wrote: > >> Usually Spark EMR job fails with the following exception in 1 hour 40 min >> - Job cancelled because SparkContext was shut down >> >> java.util.c

GraphX does not unpersist RDDs

2016-01-04 Thread Alexander Pivovarov
// open spark-shell 1.5.2 // run import org.apache.spark.graphx._ val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1) val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1) val g0 = Graph(vert, edges) val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2) val

Re: GraphX does not unpersist RDDs

2016-01-06 Thread Alexander Pivovarov
The same issue exists in spark-1.6.0 I've opened Jira ticket for that https://issues.apache.org/jira/browse/SPARK-12655 On Mon, Jan 4, 2016 at 11:30 PM, Alexander Pivovarov wrote: > // open spark-shell 1.5.2 > // run > > import org.apache.spark.graphx._ > > val vert =

spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-08 Thread Alexander Pivovarov
Lets say that yarn has 53GB memory available on each slave spark.am container needs 896MB. (512 + 384) I see two options to configure spark: 1. configure spark executors to use 52GB and leave 1 GB on each box. So, some box will also run am container. So, 1GB memory will not be used on all slave

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
> > be worse than wasting a lot of memory on a single node. > > > > So yeah, I also don't like either option. Is this just the price you pay > for > > running on YARN? > > > > > > ~ Jonathan > > > > On Mon, Feb 8, 2016 at 9:03 PM Alexand

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
or > the AM as the only way to satisfy the request. > > On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov > wrote: > > If I add additional small box to the cluster can I configure yarn to > select > > small box to run am container? > > > > > > On Mon, Fe

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
an Owen" wrote: > >> > >> If it's too small to run an executor, I'd think it would be chosen for > >> the AM as the only way to satisfy the request. > >> > >> On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov > >> wrote: > >

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
it would be nice if we have executorsPerBox setting in EMR. I have a case when I need to run 2 or 4 executors on r3.2xlarge On Tue, Feb 9, 2016 at 9:56 AM, Alexander Pivovarov wrote: > I use hadoop 2.7.1 > On Feb 9, 2016 9:54 AM, "Marcelo Vanzin" wrote: > >>

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
I mean Jonathan On Tue, Feb 9, 2016 at 10:41 AM, Alexander Pivovarov wrote: > I decided to do YARN over-commit and add 896 > to yarn.nodemanager.resource.memory-mb > it was 54,272 > now I set it to 54,272+896 = 55,168 > > Kelly, can I ask you couple questions > 1. it is

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
our >> version of YARN supports node labels (and you've added a label to the >> node where you want the AM to run). >> >> On Tue, Feb 9, 2016 at 9:51 AM, Alexander Pivovarov >> wrote: >> > Am container starts first and yarn selects random computer

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
Can you add an ability to set custom yarn labels instead/in addition to? On Feb 9, 2016 3:28 PM, "Jonathan Kelly" wrote: > Oh, sheesh, how silly of me. I copied and pasted that setting name without > even noticing the "mapreduce" in it. Yes, I guess that would mean that > Spark AMs are probably r

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
rations parameter of > http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_InstanceGroupConfig.html. > Unfortunately, it's not currently possible to specify per-instance-group > configurations via the CLI though, only cluster wide configurations. > > ~ Jonathan > > > On T

spark yarn exec container fails if yarn.nodemanager.local-dirs value starts with file://

2016-02-27 Thread Alexander Pivovarov
Spark yarn executor container fails if yarn.nodemanager.local-dirs starts with file:// yarn.nodemanager.local-dirs file:///data01/yarn/nm,file:///data02/yarn/nm other application, e.g. Hadoop MR and Hive work normally Spark works only if yarn.nodemanager.local-dirs does not hav

What should be spark.local.dir in spark on yarn?

2016-02-29 Thread Alexander Pivovarov
I have Spark on yarn I defined yarn.nodemanager.local-dirs to be /data01/yarn/nm,/data02/yarn/nm when I look at yarn executor container log I see that blockmanager files created in /data01/yarn/nm,/data02/yarn/nm But output files to upload to s3 still created in /tmp on slaves I do not want Spa

Re: What should be spark.local.dir in spark on yarn?

2016-03-01 Thread Alexander Pivovarov
data. What do you mean "But output files to > upload to s3 still created in /tmp on slaves" ? You should have control on > where to store your output data if that means your job's output. > > On Tue, Mar 1, 2016 at 3:12 AM, Alexander Pivovarov > wrote: > >> I

Will not store rdd_16_4383 as it would require dropping another block from the same RDD

2016-04-15 Thread Alexander Pivovarov
I run Spark 1.6.1 on YARN (EMR-4.5.0) I call RDD.count on MEMORY_ONLY_SER cached RDD (spark.serializer is KryoSerializer) after count task is done I noticed that Spark UI shows that RDD Fraction Cached is 6% only Size in Memory = 65.3 GB I looked at Executors stderr on Spark UI and saw lots of

Spark uses disk instead of memory to store RDD blocks

2016-05-12 Thread Alexander Pivovarov
Hello Everyone I use Spark 1.6.0 on YARN (EMR-4.3.0) I use MEMORY_AND_DISK_SER StorageLevel for my RDD. And I use Kryo Serializer I noticed that Spark uses Disk to store some RDD blocks even if Executors have lots memory available. See the screenshot http://postimg.org/image/gxpsw1fk1/ Any ide

Re: Spark uses disk instead of memory to store RDD blocks

2016-05-12 Thread Alexander Pivovarov
> spills > some blocks of RDDs into disk when execution memory runs short. > > // maropu > > On Fri, May 13, 2016 at 6:16 AM, Alexander Pivovarov > wrote: > >> Hello Everyone >> >> I use Spark 1.6.0 on YARN (EMR-4.3.0) >> >> I use MEMORY_AND

combitedTextFile and CombineTextInputFormat

2016-05-14 Thread Alexander Pivovarov
Hello Everyone Do you think it would be useful to add combinedTextFile method (which uses CombineTextInputFormat) to SparkContext? It allows one task to read data from multiple text files and control number of RDD partitions by setting mapreduce.input.fileinputformat.split.maxsize def combine

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Alexander Pivovarov
ariants. > > Note that this is something we are doing automatically in Spark SQL for > file sources (Dataset/DataFrame). > > > On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov > wrote: > >> Hello Everyone >> >> Do you think it would be useful to add c

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Alexander Pivovarov
hao wrote: > From my understanding I think newAPIHadoopFile or hadoopFIle is generic > enough for you to support any InputFormat you wanted. IMO it is not so > necessary to add a new API for this. > > On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov < > apivova...@gmail.com

Dataset API agg question

2016-06-07 Thread Alexander Pivovarov
I'm trying to switch from RDD API to Dataset API My question is about reduceByKey method e.g. in the following example I'm trying to rewrite sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10) using DS API. That is what I have so far: Seq(1->2, 1->5, 3->6).toDS.groupBy(_._1).ag

Re: Dataset API agg question

2016-06-07 Thread Alexander Pivovarov
Ted, It does not work like that you have to .map(toAB).toDS On Tue, Jun 7, 2016 at 4:07 PM, Ted Yu wrote: > Have you tried the following ? > > Seq(1->2, 1->5, 3->6).toDS("a", "b") > > then you can refer to columns by name. > > FYI > &g

Kryo registration for Tuples?

2016-06-08 Thread Alexander Pivovarov
if my RDD is RDD[(String, (Long, MyClass))] Do I need to register classOf[MyClass] classOf[(Any, Any)] or classOf[MyClass] classOf[(Long, MyClass)] classOf[(String, (Long, MyClass))] ?

Re: Kryo registration for Tuples?

2016-06-08 Thread Alexander Pivovarov
ne 8, 2016, Ted Yu wrote: > >> I think the second group (3 classOf's) should be used. >> >> Cheers >> >> On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov > > wrote: >> >>> if my RDD is RDD[(String, (Long, MyClass))] >>> &g

rdd.distinct with Partitioner

2016-06-08 Thread Alexander Pivovarov
most of the RDD methods which shuffle data take Partitioner as a parameter But rdd.distinct does not have such signature Should I open a PR for that? /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null

Re: rdd.distinct with Partitioner

2016-06-08 Thread Alexander Pivovarov
> I have created a notebook and you can check it out. ( > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html > ) > > Best regards, > > Yang > > > 在 2016年6月9日,上午11