Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
Ok, so that worked flawlessly after I upped the number of partitions to 400 from 40. Thanks! On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung wrote: > I'll try that, as of now I have a small number of partitions in the order > of 20~40. > > It would be great if there's

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
the user space). Otherwise, it's like shooting in the dark. On Fri, May 13, 2016 at 7:20 PM, Ted Yu wrote: > Have you taken a look at SPARK-11293 ? > > Consider using repartition to increase the number of partitions. > > FYI > > On Fri, May 13, 2016 at 12:14 PM, Sun

Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
Hello, I'm using Spark version 1.6.0 and have trouble with memory when trying to do reducebykey on a dataset with as many as 75 million keys. I.e. I get the following exception when I run the task. There are 20 workers in the cluster. It is running under the standalone mode with 12 GB assigned pe

How Spark handles dead machines during a job.

2016-04-08 Thread Sung Hwan Chung
Hello, Say, that I'm doing a simple rdd.map followed by collect. Say, also, that one of the executors finish all of its tasks, but there are still other executors running. If the machine that hosted the finished executor gets terminated, does the master still have the results from the finished ta

Re: Executor shutdown hooks?

2016-04-06 Thread Sung Hwan Chung
tOnCancel set, then you > can catch the interrupt exception within the Task. > > On Wed, Apr 6, 2016 at 12:24 PM, Sung Hwan Chung > wrote: > >> Hi, >> >> I'm looking for ways to add shutdown hooks to executors : i.e., when a >> Job is forcefully terminated

Executor shutdown hooks?

2016-04-06 Thread Sung Hwan Chung
Hi, I'm looking for ways to add shutdown hooks to executors : i.e., when a Job is forcefully terminated before it finishes. The scenario goes likes this : executors are running a long running job within a 'map' function. The user decides to terminate the job, then the mappers should perform some

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 28 March 2016 at 23:30, Sung Hwan Chung > wrote: > >> Yea, that seems to be the case. It seems that dynamically resizing a >> standalone Spark cluster is very

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
> http://talebzadehmich.wordpress.com > > > > On 28 March 2016 at 22:58, Sung Hwan Chung > wrote: > >> It seems that the conf/slaves file is only for consumption by the >> following scripts: >> >> sbin/start-slaves.sh >> sbin/stop-slaves.sh

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
It seems that the conf/slaves file is only for consumption by the following scripts: sbin/start-slaves.sh sbin/stop-slaves.sh sbin/start-all.sh sbin/stop-all.sh I.e., conf/slaves file doesn't affect a running cluster. Is this true? On Mon, Mar 28, 2016 at 9:31 PM, Sung Hwan Chung

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
edin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 28 March 2016 at 22:06, Sung Hwan Chung > wrote: > >> Hello, >> >> I found that I could dynamically add/remove new workers to a runn

Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
Hello, I found that I could dynamically add/remove new workers to a running standalone Spark cluster by simply triggering: start-slave.sh (SPARK_MASTER_ADDR) and stop-slave.sh E.g., I could instantiate a new AWS instance and just add it to a running cluster without needing to add it to slaves

Parquet StringType column readable as plain-text despite being Gzipped

2016-02-03 Thread Sung Hwan Chung
Hello, We are using the default compression codec for Parquet when we store our dataframes. The dataframe has a StringType column whose values can be upto several MBs large. The funny thing is that once it's stored, we can browse the file content with a plain text editor and see large portions of

Re: Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
t; yes, you can add/remove instances to the cluster on fly (CORE instances >> support add only, TASK instances - add and remove) >> >> >> >> On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung < >> coded...@cs.stanford.edu> wrote: >> >>> I noticed

Re: Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
, Alexander Pivovarov wrote: > you can use EMR-4.3.0 run on spot instances to control the price > > yes, you can add/remove instances to the cluster on fly (CORE instances > support add only, TASK instances - add and remove) > > > > On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwa

Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
I noticed that in the main branch, the ec2 directory along with the spark-ec2 script is no longer present. Is spark-ec2 going away in the next release? If so, what would be the best alternative at that time? A couple more additional questions: 1. Is there any way to add/remove additional workers

Re: java.io.IOException Error in task deserialization

2014-10-10 Thread Sung Hwan Chung
is issue? That will help us a lot to improve > TorrentBroadcast. > > Thanks! > > On Fri, Oct 10, 2014 at 8:46 AM, Sung Hwan Chung > wrote: > > I haven't seen this at all since switching to HttpBroadcast. It seems > > TorrentBroadcast might have some issues? > &

Spark job (not Spark streaming) doesn't delete un-needed checkpoints.

2014-10-10 Thread Sung Hwan Chung
Un-needed checkpoints are not getting automatically deleted in my application. I.e. the lineage looks something like this and checkpoints simply accumulate in a temporary directory (every lineage point, however, does zip with a globally permanent): PermanentRDD:Global zips with all the interm

Re: java.io.IOException Error in task deserialization

2014-10-10 Thread Sung Hwan Chung
I haven't seen this at all since switching to HttpBroadcast. It seems TorrentBroadcast might have some issues? On Thu, Oct 9, 2014 at 4:28 PM, Sung Hwan Chung wrote: > I don't think that I saw any other error message. This is all I saw. > > I'm currently experimenti

Intermittent checkpointing failure.

2014-10-09 Thread Sung Hwan Chung
I'm getting DFS closed channel exception every now and then when I run checkpoint. I do checkpointing every 15 minutes or so. This happens usually after running the job for 1~2 hours. Anyone seen this before? Job aborted due to stage failure: Task 6 in stage 70.0 failed 4 times, most recent failur

Re: java.io.IOException Error in task deserialization

2014-10-09 Thread Sung Hwan Chung
I don't think that I saw any other error message. This is all I saw. I'm currently experimenting to see if this can be alleviated by using HttpBroadcastFactory instead of TorrentBroadcast. So far, with HttpBroadcast, I haven't seen this recurring as of yet. I'll keep you posted. On Thu, Oct 9, 20

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sung Hwan Chung
makes the ordering > deterministic. > > On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung > wrote: > > Let's say you have some rows in a dataset (say X partitions initially). > > > > A > > B > > C > > D > > E > > . > > . > > . > >

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Sung Hwan Chung
rve the ordering of an RDD. > > On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung > wrote: > > I noticed that repartition will result in non-deterministic lineage > because > > it'll result in changed orders for rows. > > > > So for instance, if you do thin

Re: java.io.IOException Error in task deserialization

2014-10-08 Thread Sung Hwan Chung
This is also happening to me on a regular basis, when the job is large with relatively large serialized objects used in each RDD lineage. A bad thing about this is that this exception always stops the whole job. On Fri, Sep 26, 2014 at 11:17 AM, Brad Miller wrote: > FWIW I suspect that each cou

coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Sung Hwan Chung
I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k'

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Sung Hwan Chung
12:13 PM, Akshat Aranya wrote: > > Using a var for RDDs in this way is not going to work. In this example, > tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after > that, you change what tx2 means, so you would end up having a circular > dependency. > &

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Sung Hwan Chung
ing a var for RDDs in this way is not going to work. In this example, > tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after > that, you change what tx2 means, so you would end up having a circular > dependency. > >> On Wed, Oct 8, 2014 at 12:01 PM, Sung H

Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Sung Hwan Chung
My job is not being fault-tolerant (e.g., when there's a fetch failure or something). The lineage of RDDs are constantly updated every iteration. However, I think that when there's a failure, the lineage information is not being correctly reapplied. It goes something like this: val rawRDD = read

Is RDD partition index consistent?

2014-10-06 Thread Sung Hwan Chung
Is the RDD partition index you get when you call mapPartitionWithIndex consistent under fault-tolerance condition? I.e. 1. Say index is 1 for one of the partitions when you call data.mapPartitionWithIndex((index, rows) => ) // Say index is 1 2. The partition fails (maybe a long with a bunch o

Re: Accessing spark context from executors?

2014-07-31 Thread Sung Hwan Chung
Nevermind. Just creating an empty hadoop configuration from executors did the trick. On Thu, Jul 31, 2014 at 6:16 PM, Sung Hwan Chung wrote: > Is there any way to get SparkContext object from executor? Or hadoop > configuration, etc. The reason is that I would like to write to HDF

Accessing spark context from executors?

2014-07-31 Thread Sung Hwan Chung
Is there any way to get SparkContext object from executor? Or hadoop configuration, etc. The reason is that I would like to write to HDFS from executors.

Re: Getting the number of slaves

2014-07-30 Thread Sung Hwan Chung
gt; you will get back 11 for both statuses. > > > > > > 2014-07-28 15:40 GMT-07:00 Sung Hwan Chung : > > > >> Do getExecutorStorageStatus and getExecutorMemoryStatus both return the > >> number of executors + the driver? > >> E.g., if I submit a job with 10

A task gets stuck after following messages in the std error log:

2014-07-30 Thread Sung Hwan Chung
14/07/30 16:08:00 INFO Executor: Running task ID 1199 14/07/30 16:08:00 INFO BlockManager: Found block broadcast_0 locally 14/07/30 16:08:00 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/30 16:08:00 INFO BlockFetcherIterator$Basic

Spark fault tolerance after a executor failure.

2014-07-30 Thread Sung Hwan Chung
I sometimes see that after fully caching the data, if one of the executors fails for some reason, that portion of cache gets lost and does not get re-cached, even though there are plenty of resources. Is this a bug or a normal behavior (V1.0.1)?

Re: Getting the number of slaves

2014-07-28 Thread Sung Hwan Chung
Do getExecutorStorageStatus and getExecutorMemoryStatus both return the number of executors + the driver? E.g., if I submit a job with 10 executors, I get 11 for getExeuctorStorageStatus.length and getExecutorMemoryStatus.size On Thu, Jul 24, 2014 at 4:53 PM, Nicolas Mai wrote: > Thanks, this i

Re: collect on partitions get very slow near the last few partitions.

2014-06-28 Thread Sung Hwan Chung
, now groupby followed by collect is very fast. On Sat, Jun 28, 2014 at 12:03 AM, Sung Hwan Chung wrote: > I'm finding the following messages in the driver. Can this potentially > have anything to do with these drastic slowdowns? > > > 14/06/28 00:00:17 INFO ShuffleBlockMana

Re: collect on partitions get very slow near the last few partitions.

2014-06-28 Thread Sung Hwan Chung
14 at 11:35 PM, Sung Hwan Chung wrote: > I'm doing something like this: > > rdd.groupBy.map().collect() > > The work load on final map is pretty much evenly distributed. > > When collect happens, say on 60 partitions, the first 55 or so partitions > finish very quic

collect on partitions get very slow near the last few partitions.

2014-06-27 Thread Sung Hwan Chung
I'm doing something like this: rdd.groupBy.map().collect() The work load on final map is pretty much evenly distributed. When collect happens, say on 60 partitions, the first 55 or so partitions finish very quickly say within 10 seconds. However, the last 5, particularly the very last one, typic

Spark executor error

2014-06-25 Thread Sung Hwan Chung
I'm seeing the following message in the log of an executor. Anyone seen this error? After this, the executor seems to lose the cache, and but besides that the whole thing slows down drastically - I.e. it gets stuck in a reduce phase for 40+ minutes, whereas before it was finishing reduces in 2~3 se

Does Spark restart cached workers even without failures?

2014-06-25 Thread Sung Hwan Chung
I'm doing coalesce with shuffle, cache and then do thousands of iterations. I noticed that sometimes Spark would for no particular reason perform partial coalesce again after running for a long time - and there was no exception or failure on the worker's part. Why is this happening?

Number of executors smaller than requested in YARN.

2014-06-25 Thread Sung Hwan Chung
Hi, When I try requesting a large number of executors - e.g. 242, it doesn't seem to actually reach that number. E.g., under the executors tab, I only see an executor ID of upto 234. This despite the fact that there're plenty more memory available as well as CPU cores, etc in the system. In fact,

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Sung Hwan Chung
I suppose what I want is the memory efficiency of toLocalIterator and the speed of collect. Is there any such thing? On Mon, Jun 9, 2014 at 3:19 PM, Sung Hwan Chung wrote: > Hello, > > I noticed that the final reduce function happens in the driver node with a > code that lo

Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Sung Hwan Chung
Hello, I noticed that the final reduce function happens in the driver node with a code that looks like the following. val outputMap = mapPartition(domsomething).reduce(a: Map, b: Map) { a.merge(b) } although individual outputs from mappers are small. Over time the aggregated result outputMap co

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
Additionally, I've encountered some confusing situation where the locality level for a task showed up as 'PROCESS_LOCAL' even though I didn't cache the data. I wonder some implicit caching happens even without the user specifying things. On Thu, Jun 5, 2014 at 3:50 PM, Su

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
ant to prevent this from happening entirely, you can set the > values to ridiculously high numbers. The documentation also mentions that > "0" has special meaning, so you can try that as well. > > Good luck! > Andrew > > > On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwa

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
s the case? On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung wrote: > I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume > that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. > > When these happen things get extremely slow. > > Does thi

When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. When these happen things get extremely slow. Does this mean that the executor got terminated and restarted? Is there a way to prevent this from happening (ba

Re: Spark assembly error.

2014-06-04 Thread Sung Hwan Chung
Nevermind, it turns out that this is a problem for the Pivotal Hadoop that we are trying to compile against. On Wed, Jun 4, 2014 at 4:16 PM, Sung Hwan Chung wrote: > When I run sbt/sbt assembly, I get the following exception. Is anyone else > experiencing a similar p

Spark assembly error.

2014-06-04 Thread Sung Hwan Chung
When I run sbt/sbt assembly, I get the following exception. Is anyone else experiencing a similar problem? .. [info] Resolving org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016 ... [info] Updating {file:/Users/Sung/Projects/spark_06_04_14/}assembly... [info] Resolving org.fuses

Checking spark cache percentage programatically. And how to clear cache.

2014-05-28 Thread Sung Hwan Chung
Hi, Is there a programmatic way of checking whether RDD has been 100% cached or not? I'd like to do this to have two different path ways. Additionally, how do you clear cache (e.g. if you want to cache different RDDs, and you'd like to clear an existing cached RDD). Thanks!

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
what I said came through. RDD zip is not hacky at all, as >> it only depends on a user not changing the partitioning. Basically, you >> would keep your losses as an RDD[Double] and zip whose with the RDD of >> examples, and update the losses. You're doing a copy (and GC) on

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
multiple > threads/partitions on one host so the map should be viewed as shared > amongst a larger space. > > > > Also with your exact description it sounds like your data should be > encoded into the RDD if its per-record/per-row: RDD[(MyBaseData, > LastIterationSideValues)

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
2014 at 1:35 AM, Sean Owen wrote: > On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung > wrote: > >> Actually, I do not know how to do something like this or whether this is >> possible - thus my suggestive statement. >> >> Can you already declare persistent memo

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
ed to actually serialize singletons and pass it back and forth in a weird manner. On Mon, Apr 28, 2014 at 1:23 AM, Sean Owen wrote: > On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung > wrote: >> >> e.g. something like >> >> rdd.mapPartition((rows : Iterator[String]) =>

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
Yes, this is a useful trick we found that made our algorithm implementation noticeably faster (btw, we'll send a pull request for this GLMNET implementation, so interested people could look at it). It would be nice if Spark supported something akin to this natively, as I believe that many efficien

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Sung Hwan Chung
tation. On Mon, Apr 21, 2014 at 11:15 AM, Marcelo Vanzin wrote: > Hi Sung, > > On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung > wrote: > > The goal is to keep an intermediate value per row in memory, which would > > allow faster subsequent computations. I.e., computeS

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Sung Hwan Chung
, 2014 at 5:54 PM, Marcelo Vanzin wrote: > Hi Sung, > > On Fri, Apr 18, 2014 at 5:11 PM, Sung Hwan Chung > wrote: > > while (true) { > > rdd.map((row : Array[Double]) => { > > row[numCols - 1] = computeSomething(row) > > }).reduce(...) > > } > &

Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-18 Thread Sung Hwan Chung
Are there scenarios where the developers have to be aware of how Spark's fault tolerance works to implement correct programs? It seems that if we want to maintain any sort of mutable state in each worker through iterations, it can have some unintended effect once a machine goes down. E.g., while

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
Sorry, that was incomplete information, I think Spark's compression helped (not sure how much though) that the actual memory requirement may have been smaller. On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung wrote: > I would argue that memory in clusters is still a limited resource

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
st clusters running Spark. > > YARN integration is actually complete in CDH5.0. We support it as well as > standalone mode. > > > > > On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen wrote: > >> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung >> wrote: >> > D

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
e them to > decrease features right ? > > A feature extraction algorithm like matrix factorization and it's variants > could be used to decrease feature space as well... > > > > On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung < > coded...@cs.stanford.edu>

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
'd estimate a couple of gigs > necessary for heap space for the worker to compute/store the histograms, > and I guess 2x that on the master to do the reduce. > > Again 2GB per worker is pretty tight, because there are overheads of just > starting the jvm, launching a worker, l

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
with the > memory overheads in model training when trees get deep - if someone wants > to modify the current implementation of trees in MLlib and contribute this > optimization as a pull request, it would be welcome! > > At any rate, we'll take this feedback into account

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
do the paper also propose to grow a shallow tree ? > > Thanks. > Deb > > > On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung > wrote: > >> Additionally, the 'random features per node' (or mtry in R) is a very >> important feature for Random Forest. The vari

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
have to be supported at the tree level. On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung wrote: > Well, if you read the original paper, > http://oz.berkeley.edu/~breiman/randomforest2001.pdf > "Grow the tree using CART methodology to maximum size and do not prune." > &g

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
ting. I agree with the assessment that forests are a variance > reduction technique, but I'd be a little surprised if a bunch of hugely > deep trees don't overfit to training data. I guess I view limiting tree > depth as an analogue to regularization in linear models. > >

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
f the decision tree gets deep, but I'd recommend a lot more > memory than 2-4GB per worker for most big data workloads. > > > > > > On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung < > coded...@cs.stanford.edu> wrote: > >> Debasish, we've tested the M

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Debasish, we've tested the MLLib decision tree a bit and it eats up too much memory for RF purposes. Once the tree got to depth 8~9, it was easy to get heap exception, even with 2~4 GB of memory per worker. With RF, it's very easy to get 100+ depth in RF with even only 100,000+ rows (because trees

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Sung Hwan Chung
ow table semantics. > > > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen > > > > On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung > wrote: > >> Hey guys, >> >> I need to tag indiv

Mutable tagging RDD rows ?

2014-03-28 Thread Sung Hwan Chung
Hey guys, I need to tag individual RDD lines with some values. This tag value would change at every iteration. Is this possible with RDD (I suppose this is sort of like mutable RDD, but it's more) ? If not, what would be the best way to do something like this? Basically, we need to keep mutable i

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
0.9.0? >> >> >> On Wed, Mar 26, 2014 at 1:47 PM, Sandy Ryza wrote: >> >>> Hi Sung, >>> >>> Are you using yarn-standalone mode? Have you specified the --addJars >>> option with your external jars? >>> >>>

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
ur external jars? > > -Sandy > > > On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung > wrote: > >> Hello, (this is Yarn related) >> >> I'm able to load an external jar and use its classes within >> ApplicationMaster. I wish to use this jar wit

YARN problem using an external jar in worker nodes Inbox x

2014-03-26 Thread Sung Hwan Chung
Hello, (this is Yarn related) I'm able to load an external jar and use its classes within ApplicationMaster. I wish to use this jar within worker nodes, so I added sc.addJar(pathToJar) and ran. I get the following exception: org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4 times