Cool. Using Ambari to monitor and scale up/down the cluster sounds
promising. Thanks for the pointer!
Mingyu
From: Deepak Sharma
Date: Monday, December 14, 2015 at 1:53 AM
To: cs user
Cc: Mingyu Kim , "user@spark.apache.org"
Subject: Re: Autoscaling of Spark YARN cluster
An
Hi all,
Has anyone tried out autoscaling Spark YARN cluster on a public cloud (e.g.
EC2) based on workload? To be clear, I¹m interested in scaling the cluster
itself up and down by adding and removing YARN nodes based on the cluster
resource utilization (e.g. # of applications queued, # of resourc
Hi all,
I filed https://issues.apache.org/jira/browse/SPARK-11081. Since Jersey’s
surface area is relatively small and seems to be only used for Spark UI and
json API, shading the dependency might make sense similar to what’s done for
Jerry dependencies at https://issues.apache.org/jira/browse/
p.mapred.JobContext.
>
>Is there something obvious that I might be doing wrong (or messed up in
>the translation from Scala to Java) or something I should look into? I'm
>using Spark 1.2 with hadoop 2.4.
>
>
>Thanks.
>
>Darin.
>
>
>__
I didn’t get any response. It’d be really appreciated if anyone using a special
OutputCommitter for S3 can comment on this!
Thanks,
Mingyu
From: Mingyu Kim mailto:m...@palantir.com>>
Date: Monday, February 16, 2015 at 1:15 AM
To: "user@spark.apache.org<mailto:user@sp
HI all,
The default OutputCommitter used by RDD, which is FileOutputCommitter, seems to
require moving files at the commit step, which is not a constant operation in
S3, as discussed in
http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3c543e33fa.2000...@entropy.be%3E.
People se
Ok, cool. This seems to be general issues in JVM with very large heaps. I
agree that the best workaround would be to keep the heap size below 32GB.
Thanks guys!
Mingyu
From: Arun Ahuja
Date: Monday, October 6, 2014 at 7:50 AM
To: Andrew Ash
Cc: Mingyu Kim , "user@spark.apache.org"
This issue definitely needs more investigation, but I just wanted to quickly
check if anyone has run into this problem or has general guidance around it.
We¹ve seen a performance degradation with a large heap on a simple map task
(I.e. No shuffle). We¹ve seen the slowness starting around from 50GB
Hence, writing to HDFS / S3 is idempotent.
>
> Now this logic is already implemented within the Hadoop's MapReduce logic, and
> Spark just uses it directly.
>
> TD
>
>
> On Tue, Jul 15, 2014 at 2:33 PM, Mingyu Kim wrote:
>> Thanks for the explanation, guys.
>>
gt; TaskSetManager.handleSuccessfulTask -> DAGScheduler.taskEnded
>>
>> in taskEnded, it will trigger the CompletionEvent message handler, where
>> DAGScheduler will check if (!job.finished(rt.outputid)) and rt.outputid is
>> the partitionid
>>
>> so even the du
Hi all,
I was curious about the details of Spark speculation. So, my understanding
is that, when ³speculated² tasks are newly scheduled on other machines, the
original tasks are still running until the entire stage completes. This
seems to leave some room for duplicated work because some spark act
Hi all,
I¹m trying to use JavaRDD.mapToPair(), but it fails with NPE on the
executor. The PairFunction used in the call is null for some reason. Any
comments/help would be appreciated!
My setup is,
* Java 7
* Spark 1.0.0
* Hadoop 2.0.0-mr1-cdh4.6.0
Here¹s the code snippet.
> import org.apache.sp
with spilling
On Fri, Jun 20, 2014 at 1:04 AM, Patrick Wendell wrote:
> Hey There,
>
> I'd like to start voting on this release shortly because there are a
> few important fixes that have queued up. We're just waiting to fix an
> akka issue. I'd guess we'll cut a
Hi all,
Is there any plan for 1.0.1 release?
Mingyu
smime.p7s
Description: S/MIME cryptographic signature
#x27;t do.
On Wed, Apr 30, 2014 at 11:13 AM, Mingyu Kim wrote:
> Okay, that makes sense. It’d be great if this can be better documented at
> some point, because the only way to find out about the resulting RDD row
> order is by looking at the code.
>
> Thanks for the discussion!
>
2,3] // some day it could be like this, it
>wouldn't violate the contract of union
>
>AFIAK the only guarentee is the resulting RDD will contain all elements.
>
>- Patrick
>
>On Tue, Apr 29, 2014 at 11:26 PM, Mingyu Kim wrote:
>> Yes, that’s what I meant. Sure, the num
gt;rdd2 = [1,4,5]
>
>rdd1.union(rdd2) = [1,2,3,1,4,5]
>
>On Tue, Apr 29, 2014 at 10:44 PM, Mingyu Kim wrote:
>> Thanks for the quick response!
>>
>> To better understand it, the reason sorted RDD has a well-defined
>>ordering
>> is because sortedRDD.getParti
gt;
>But that ordering is lost as soon as you transform the RDD, including
>if you union it with another RDD.
>
>On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim wrote:
>> Hi Patrick,
>>
>> I¹m a little confused about your comment that RDDs are not ordered. As
>>fa
Hi Patrick,
I¹m a little confused about your comment that RDDs are not ordered. As far
as I know, RDDs keep list of partitions that are ordered and this is why I
can call RDD.take() and get the same first k rows every time I call it and
RDD.take() returns the same entries as RDD.map().take() beca
If the underlying file system returns files in a non-alphabetical order to
java.io.File.listFiles(), Spark reads the partitions out of order. Here¹s an
example.
var sc = new SparkContext(³local[3]², ³test²);
var rdd1 = sc.parallelize([1,2,3,4,5]);
rdd1.saveAsTextFile(³file://path/to/file²);
var rd
20 matches
Mail list logo