Hi Reynold
Spark 1.6.x has serious bug related to shuffle functionality
https://issues.apache.org/jira/browse/SPARK-14560
https://issues.apache.org/jira/browse/SPARK-4452
Shuffle throws OOM on serious load. I've seen this error several times on
my heavy jobs
java.lang.OutOfMemoryError: Unable to
Also, can you include MaxPermSize fix to spark-1.6.3?
https://issues.apache.org/jira/browse/SPARK-15067
Literally, just 1 word should be replaced
https://github.com/apache/spark/pull/12985/files
On Fri, Oct 14, 2016 at 1:57 PM, Alexander Pivovarov
wrote:
> Hi Reynold
>
> Spark
Hi Everyone
I noticed that spark ec2 script is outdated.
How to add 1.5.2 support to ec2/spark_ec2.py?
What else (except of updating spark version in the script) should be done
to add 1.5.2 support?
We also need to update scala to 2.10.4 (currently it's 2.10.3)
Alex
just want to follow up
On Nov 25, 2015 9:19 PM, "Alexander Pivovarov" wrote:
> Hi Everyone
>
> I noticed that spark ec2 script is outdated.
> How to add 1.5.2 support to ec2/spark_ec2.py?
> What else (except of updating spark version in the script) should be done
>
> https://github.com/apache/spark/commit/97956669053646f00131073358e53b05d0c3d5d0#diff-ada66bbeb2f1327b508232ef6c3805a5
> to the master branch as well
>
> Thanks
> Shivaram
>
>
>
> On Mon, Nov 30, 2015 at 11:38 PM, Alexander Pivovarov
> wrote:
> > just want to
Did 6 min ago
On Tue, Dec 1, 2015 at 12:49 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:
> Yeah - that needs to be changed as well. Could you send a PR to fix this ?
>
> Shivaram
>
> On Tue, Dec 1, 2015 at 12:32 AM, Alexander Pivovarov
> wrote:
> &
I run Spark 1.5.2 on YARN (EMR)
I noticed that my long running jobs always failed after 1h 40 min (6000s)
with the exceptions below.
Then I found that Spark has spark.akka.heartbeat.pauses=6000s by default
I changed the settings to the following and it solve my issue.
"spark.akka.heartbeat.pau
more
On Sun, Dec 20, 2015 at 10:42 AM, Alexander Pivovarov
wrote:
> I run Spark 1.5.2 on YARN (EMR)
>
> I noticed that my long running jobs always failed after 1h 40 min (6000s)
> with the exceptions below.
>
> Then I found that Spark has spark.akka.heartbeat.pauses=6000s
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
On Sun, Dec 20, 2015 at 11:28 AM, Alexander Pivovarov
wrote:
> it can also fail with the following message
>
> Exception in thread "main" org.apache.spar
it$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
On Sun, Dec 20, 2015 at 11:29 AM, Alexander Pivovarov
wrote:
> Or this m
scover / track? Thanks!
>
> On Sun, Dec 20, 2015 at 11:35 AM Alexander Pivovarov
> wrote:
>
>> Usually Spark EMR job fails with the following exception in 1 hour 40 min
>> - Job cancelled because SparkContext was shut down
>>
>> java.util.c
// open spark-shell 1.5.2
// run
import org.apache.spark.graphx._
val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)
val g0 = Graph(vert, edges)
val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
val
The same issue exists in spark-1.6.0
I've opened Jira ticket for that
https://issues.apache.org/jira/browse/SPARK-12655
On Mon, Jan 4, 2016 at 11:30 PM, Alexander Pivovarov
wrote:
> // open spark-shell 1.5.2
> // run
>
> import org.apache.spark.graphx._
>
> val vert =
Lets say that yarn has 53GB memory available on each slave
spark.am container needs 896MB. (512 + 384)
I see two options to configure spark:
1. configure spark executors to use 52GB and leave 1 GB on each box. So,
some box will also run am container. So, 1GB memory will not be used on all
slave
> > be worse than wasting a lot of memory on a single node.
> >
> > So yeah, I also don't like either option. Is this just the price you pay
> for
> > running on YARN?
> >
> >
> > ~ Jonathan
> >
> > On Mon, Feb 8, 2016 at 9:03 PM Alexand
or
> the AM as the only way to satisfy the request.
>
> On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov
> wrote:
> > If I add additional small box to the cluster can I configure yarn to
> select
> > small box to run am container?
> >
> >
> > On Mon, Fe
an Owen" wrote:
> >>
> >> If it's too small to run an executor, I'd think it would be chosen for
> >> the AM as the only way to satisfy the request.
> >>
> >> On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov
> >> wrote:
> >
it would be nice if we have
executorsPerBox setting in EMR.
I have a case when I need to run 2 or 4 executors on r3.2xlarge
On Tue, Feb 9, 2016 at 9:56 AM, Alexander Pivovarov
wrote:
> I use hadoop 2.7.1
> On Feb 9, 2016 9:54 AM, "Marcelo Vanzin" wrote:
>
>>
I mean Jonathan
On Tue, Feb 9, 2016 at 10:41 AM, Alexander Pivovarov
wrote:
> I decided to do YARN over-commit and add 896
> to yarn.nodemanager.resource.memory-mb
> it was 54,272
> now I set it to 54,272+896 = 55,168
>
> Kelly, can I ask you couple questions
> 1. it is
our
>> version of YARN supports node labels (and you've added a label to the
>> node where you want the AM to run).
>>
>> On Tue, Feb 9, 2016 at 9:51 AM, Alexander Pivovarov
>> wrote:
>> > Am container starts first and yarn selects random computer
Can you add an ability to set custom yarn labels instead/in addition to?
On Feb 9, 2016 3:28 PM, "Jonathan Kelly" wrote:
> Oh, sheesh, how silly of me. I copied and pasted that setting name without
> even noticing the "mapreduce" in it. Yes, I guess that would mean that
> Spark AMs are probably r
rations parameter of
> http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_InstanceGroupConfig.html.
> Unfortunately, it's not currently possible to specify per-instance-group
> configurations via the CLI though, only cluster wide configurations.
>
> ~ Jonathan
>
>
> On T
Spark yarn executor container fails if yarn.nodemanager.local-dirs starts
with file://
yarn.nodemanager.local-dirs
file:///data01/yarn/nm,file:///data02/yarn/nm
other application, e.g. Hadoop MR and Hive work normally
Spark works only if yarn.nodemanager.local-dirs does not hav
I have Spark on yarn
I defined yarn.nodemanager.local-dirs to be /data01/yarn/nm,/data02/yarn/nm
when I look at yarn executor container log I see that blockmanager files
created in /data01/yarn/nm,/data02/yarn/nm
But output files to upload to s3 still created in /tmp on slaves
I do not want Spa
data. What do you mean "But output files to
> upload to s3 still created in /tmp on slaves" ? You should have control on
> where to store your output data if that means your job's output.
>
> On Tue, Mar 1, 2016 at 3:12 AM, Alexander Pivovarov
> wrote:
>
>> I
I run Spark 1.6.1 on YARN (EMR-4.5.0)
I call RDD.count on MEMORY_ONLY_SER cached RDD (spark.serializer is
KryoSerializer)
after count task is done I noticed that Spark UI shows that RDD Fraction
Cached is 6% only
Size in Memory = 65.3 GB
I looked at Executors stderr on Spark UI and saw lots of
Hello Everyone
I use Spark 1.6.0 on YARN (EMR-4.3.0)
I use MEMORY_AND_DISK_SER StorageLevel for my RDD. And I use Kryo Serializer
I noticed that Spark uses Disk to store some RDD blocks even if Executors
have lots memory available. See the screenshot
http://postimg.org/image/gxpsw1fk1/
Any ide
> spills
> some blocks of RDDs into disk when execution memory runs short.
>
> // maropu
>
> On Fri, May 13, 2016 at 6:16 AM, Alexander Pivovarov > wrote:
>
>> Hello Everyone
>>
>> I use Spark 1.6.0 on YARN (EMR-4.3.0)
>>
>> I use MEMORY_AND
Hello Everyone
Do you think it would be useful to add combinedTextFile method (which uses
CombineTextInputFormat) to SparkContext?
It allows one task to read data from multiple text files and control number
of RDD partitions by setting
mapreduce.input.fileinputformat.split.maxsize
def combine
ariants.
>
> Note that this is something we are doing automatically in Spark SQL for
> file sources (Dataset/DataFrame).
>
>
> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov > wrote:
>
>> Hello Everyone
>>
>> Do you think it would be useful to add c
hao wrote:
> From my understanding I think newAPIHadoopFile or hadoopFIle is generic
> enough for you to support any InputFormat you wanted. IMO it is not so
> necessary to add a new API for this.
>
> On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov <
> apivova...@gmail.com
I'm trying to switch from RDD API to Dataset API
My question is about reduceByKey method
e.g. in the following example I'm trying to rewrite
sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10)
using DS API. That is what I have so far:
Seq(1->2, 1->5,
3->6).toDS.groupBy(_._1).ag
Ted, It does not work like that
you have to .map(toAB).toDS
On Tue, Jun 7, 2016 at 4:07 PM, Ted Yu wrote:
> Have you tried the following ?
>
> Seq(1->2, 1->5, 3->6).toDS("a", "b")
>
> then you can refer to columns by name.
>
> FYI
>
&g
if my RDD is RDD[(String, (Long, MyClass))]
Do I need to register
classOf[MyClass]
classOf[(Any, Any)]
or
classOf[MyClass]
classOf[(Long, MyClass)]
classOf[(String, (Long, MyClass))]
?
ne 8, 2016, Ted Yu wrote:
>
>> I think the second group (3 classOf's) should be used.
>>
>> Cheers
>>
>> On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov > > wrote:
>>
>>> if my RDD is RDD[(String, (Long, MyClass))]
>>>
&g
most of the RDD methods which shuffle data take Partitioner as a parameter
But rdd.distinct does not have such signature
Should I open a PR for that?
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] =
null
> I have created a notebook and you can check it out. (
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html
> )
>
> Best regards,
>
> Yang
>
>
> 在 2016年6月9日,上午11
38 matches
Mail list logo