Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Cool. Using Ambari to monitor and scale up/down the cluster sounds promising. Thanks for the pointer! Mingyu From: Deepak Sharma Date: Monday, December 14, 2015 at 1:53 AM To: cs user Cc: Mingyu Kim , "user@spark.apache.org" Subject: Re: Autoscaling of Spark YARN cluster An

Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
review², and I didn¹t find much else from my search. This might be a general YARN question, but wanted to check if there¹s a solution popular in the Spark community. Any sharing of experience around autoscaling will be helpful! Thanks, Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: compatibility issue with Jersey2

2015-10-13 Thread Mingyu Kim
/SPARK-3996. Would this be reasonable? Mingyu On 10/7/15, 11:26 AM, "Marcelo Vanzin" wrote: >Seems like you might be running into >https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D10910&d=CQIBaQ&c=izlc9mHr637UR4lpLEZLF

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Mingyu Kim
Cool, we will start from there. Thanks Aaron and Josh! Darin, it¹s likely because the DirectOutputCommitter is compiled with Hadoop 1 classes and you¹re running it with Hadoop 2. org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and it became an interface in Hadoop 2. Mingyu

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Mingyu Kim
I didn’t get any response. It’d be really appreciated if anyone using a special OutputCommitter for S3 can comment on this! Thanks, Mingyu From: Mingyu Kim mailto:m...@palantir.com>> Date: Monday, February 16, 2015 at 1:15 AM To: "user@spark.apache.org<mailto:user@sp

Which OutputCommitter to use for S3?

2015-02-16 Thread Mingyu Kim
n with Spark. Thanks, Mingyu

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-23 Thread mingyu
I found a workaround. I can make my auxiliary data a RDD. Partition it and cache it. Later, I can cogroup it with other RDDs and Spark will try to keep the cached RDD partitions where they are and not shuffle them. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabbl

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-22 Thread mingyu
Also, Setting spark.locality.wait=100 did not work for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition-sticky-i-e-stay-with-node-tp21322p21325.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to make spark partition sticky, i.e. stay with node?

2015-01-22 Thread mingyu
mount partition specific auxiliary data for processing the stream. I noticed that the partitions move among the nodes. I cannot afford to move the large auxiliary data around. Thanks, Mingyu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-part

Re: Larger heap leads to perf degradation due to GC

2014-10-06 Thread Mingyu Kim
Ok, cool. This seems to be general issues in JVM with very large heaps. I agree that the best workaround would be to keep the heap size below 32GB. Thanks guys! Mingyu From: Arun Ahuja Date: Monday, October 6, 2014 at 7:50 AM To: Andrew Ash Cc: Mingyu Kim , "user@spark.apache.org"

Larger heap leads to perf degradation due to GC

2014-10-02 Thread Mingyu Kim
heap. (I.e. spark.executor.memoty=50g) And, when we checked the CPU usage, there were just a lot of GCs going on. Has anyone seen a similar problem? Thanks, Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: How does Spark speculation prevent duplicated work?

2014-07-16 Thread Mingyu Kim
That makes sense. Thanks everyone for the explanations! Mingyu From: Matei Zaharia Reply-To: "user@spark.apache.org" Date: Tuesday, July 15, 2014 at 3:00 PM To: "user@spark.apache.org" Subject: Re: How does Spark speculation prevent duplicated work? Yeah, this is ha

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
level? Mingyu From: Bertrand Dechoux Reply-To: "user@spark.apache.org" Date: Tuesday, July 15, 2014 at 1:22 PM To: "user@spark.apache.org" Subject: Re: How does Spark speculation prevent duplicated work? I haven't look at the implementation but what you would

How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
actions are not idempotent. For example, it may be counting a partition twice in case of RDD.count or may be writing a partition to HDFS twice in case of RDD.save*(). How does it prevent this kind of duplicated work? Mingyu smime.p7s Description: S/MIME cryptographic signature

JavaRDD.mapToPair throws NPE

2014-06-24 Thread Mingyu Kim
.scala:1207) > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispa > tcher.scala:386) > > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:133 > 9) > > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:1 > 07) Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: 1.0.1 release plan

2014-06-20 Thread Mingyu Kim
Cool. Thanks for the note. Looking forward to it. Mingyu From: Andrew Ash Reply-To: "user@spark.apache.org" Date: Friday, June 20, 2014 at 9:54 AM To: "user@spark.apache.org" Subject: Re: 1.0.1 release plan Sounds good. Mingyu and I are waiting on 1.0.1 to get t

1.0.1 release plan

2014-06-19 Thread Mingyu Kim
Hi all, Is there any plan for 1.0.1 release? Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
. (and, sort is really expensive.) On the other hand, if I can assume, say, “filter” or “map” doesn’t shuffle the rows around, I can do the sort once and assume that the order is retained throughout such operations saving a lot of time from doing unnecessary sorts. Mingyu From: Mark Hamstra Reply

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
Okay, that makes sense. It’d be great if this can be better documented at some point, because the only way to find out about the resulting RDD row order is by looking at the code. Thanks for the discussion! Mingyu On 4/29/14, 11:59 PM, "Patrick Wendell" wrote: >I don't

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
union two RDDs, for example, rdd1 = [“a, b, c”], rdd2 = [“1, 2, 3”, “4, 5, 6”], then rdd1.union(rdd2).saveAsTextFile(…) should’ve resulted in a file with three lines “a, b, c”, “1, 2, 3”, and “4, 5, 6” because the partitions from the two reds are concatenated. Mingyu On 4/29/14, 10:55 PM

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
’m not sure why union doesn’t respect the order because union operation simply concatenates the two lists of partitions from the two RDDs. Mingyu On 4/29/14, 10:25 PM, "Patrick Wendell" wrote: >You are right, once you sort() the RDD, then yes it has a well defined >ordering. &

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
() because map preserves the partition order. RDD order is also what allows me to get the top k out of RDD by doing RDD.sort().take(). Am I misunderstanding it? Or, is it just when RDD is written to disk that the order is not well preserved? Thanks in advance! Mingyu On 1/22/14, 4:46 PM, "Pa

Spark reads partitions in a wrong order

2014-04-25 Thread Mingyu Kim
design? Is this a bug? Mingyu smime.p7s Description: S/MIME cryptographic signature