Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Cool. Using Ambari to monitor and scale up/down the cluster sounds promising. Thanks for the pointer! Mingyu From: Deepak Sharma Date: Monday, December 14, 2015 at 1:53 AM To: cs user Cc: Mingyu Kim , "user@spark.apache.org" Subject: Re: Autoscaling of Spark YARN cluster An

Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Hi all, Has anyone tried out autoscaling Spark YARN cluster on a public cloud (e.g. EC2) based on workload? To be clear, I¹m interested in scaling the cluster itself up and down by adding and removing YARN nodes based on the cluster resource utilization (e.g. # of applications queued, # of resourc

Re: compatibility issue with Jersey2

2015-10-13 Thread Mingyu Kim
Hi all, I filed https://issues.apache.org/jira/browse/SPARK-11081. Since Jersey’s surface area is relatively small and seems to be only used for Spark UI and json API, shading the dependency might make sense similar to what’s done for Jerry dependencies at https://issues.apache.org/jira/browse/

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Mingyu Kim
p.mapred.JobContext. > >Is there something obvious that I might be doing wrong (or messed up in >the translation from Scala to Java) or something I should look into? I'm >using Spark 1.2 with hadoop 2.4. > > >Thanks. > >Darin. > > >__

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Mingyu Kim
I didn’t get any response. It’d be really appreciated if anyone using a special OutputCommitter for S3 can comment on this! Thanks, Mingyu From: Mingyu Kim mailto:m...@palantir.com>> Date: Monday, February 16, 2015 at 1:15 AM To: "user@spark.apache.org<mailto:user@sp

Which OutputCommitter to use for S3?

2015-02-16 Thread Mingyu Kim
HI all, The default OutputCommitter used by RDD, which is FileOutputCommitter, seems to require moving files at the commit step, which is not a constant operation in S3, as discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3c543e33fa.2000...@entropy.be%3E. People se

Re: Larger heap leads to perf degradation due to GC

2014-10-06 Thread Mingyu Kim
Ok, cool. This seems to be general issues in JVM with very large heaps. I agree that the best workaround would be to keep the heap size below 32GB. Thanks guys! Mingyu From: Arun Ahuja Date: Monday, October 6, 2014 at 7:50 AM To: Andrew Ash Cc: Mingyu Kim , "user@spark.apache.org"

Larger heap leads to perf degradation due to GC

2014-10-02 Thread Mingyu Kim
This issue definitely needs more investigation, but I just wanted to quickly check if anyone has run into this problem or has general guidance around it. We¹ve seen a performance degradation with a large heap on a simple map task (I.e. No shuffle). We¹ve seen the slowness starting around from 50GB

Re: How does Spark speculation prevent duplicated work?

2014-07-16 Thread Mingyu Kim
Hence, writing to HDFS / S3 is idempotent. > > Now this logic is already implemented within the Hadoop's MapReduce logic, and > Spark just uses it directly. > > TD > > > On Tue, Jul 15, 2014 at 2:33 PM, Mingyu Kim wrote: >> Thanks for the explanation, guys. >>

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
gt; TaskSetManager.handleSuccessfulTask -> DAGScheduler.taskEnded >> >> in taskEnded, it will trigger the CompletionEvent message handler, where >> DAGScheduler will check if (!job.finished(rt.outputid)) and rt.outputid is >> the partitionid >> >> so even the du

How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
Hi all, I was curious about the details of Spark speculation. So, my understanding is that, when ³speculated² tasks are newly scheduled on other machines, the original tasks are still running until the entire stage completes. This seems to leave some room for duplicated work because some spark act

JavaRDD.mapToPair throws NPE

2014-06-24 Thread Mingyu Kim
Hi all, I¹m trying to use JavaRDD.mapToPair(), but it fails with NPE on the executor. The PairFunction used in the call is null for some reason. Any comments/help would be appreciated! My setup is, * Java 7 * Spark 1.0.0 * Hadoop 2.0.0-mr1-cdh4.6.0 Here¹s the code snippet. > import org.apache.sp

Re: 1.0.1 release plan

2014-06-20 Thread Mingyu Kim
with spilling On Fri, Jun 20, 2014 at 1:04 AM, Patrick Wendell wrote: > Hey There, > > I'd like to start voting on this release shortly because there are a > few important fixes that have queued up. We're just waiting to fix an > akka issue. I'd guess we'll cut a

1.0.1 release plan

2014-06-19 Thread Mingyu Kim
Hi all, Is there any plan for 1.0.1 release? Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
#x27;t do. On Wed, Apr 30, 2014 at 11:13 AM, Mingyu Kim wrote: > Okay, that makes sense. It’d be great if this can be better documented at > some point, because the only way to find out about the resulting RDD row > order is by looking at the code. > > Thanks for the discussion! >

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
2,3] // some day it could be like this, it >wouldn't violate the contract of union > >AFIAK the only guarentee is the resulting RDD will contain all elements. > >- Patrick > >On Tue, Apr 29, 2014 at 11:26 PM, Mingyu Kim wrote: >> Yes, that’s what I meant. Sure, the num

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
gt;rdd2 = [1,4,5] > >rdd1.union(rdd2) = [1,2,3,1,4,5] > >On Tue, Apr 29, 2014 at 10:44 PM, Mingyu Kim wrote: >> Thanks for the quick response! >> >> To better understand it, the reason sorted RDD has a well-defined >>ordering >> is because sortedRDD.getParti

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
gt; >But that ordering is lost as soon as you transform the RDD, including >if you union it with another RDD. > >On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim wrote: >> Hi Patrick, >> >> I¹m a little confused about your comment that RDDs are not ordered. As >>fa

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
Hi Patrick, I¹m a little confused about your comment that RDDs are not ordered. As far as I know, RDDs keep list of partitions that are ordered and this is why I can call RDD.take() and get the same first k rows every time I call it and RDD.take() returns the same entries as RDD.map(Š).take() beca

Spark reads partitions in a wrong order

2014-04-25 Thread Mingyu Kim
If the underlying file system returns files in a non-alphabetical order to java.io.File.listFiles(), Spark reads the partitions out of order. Here¹s an example. var sc = new SparkContext(³local[3]², ³test²); var rdd1 = sc.parallelize([1,2,3,4,5]); rdd1.saveAsTextFile(³file://path/to/file²); var rd