Ok, so that worked flawlessly after I upped the number of partitions to 400
from 40.
Thanks!
On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung
wrote:
> I'll try that, as of now I have a small number of partitions in the order
> of 20~40.
>
> It would be great if there's
the user space).
Otherwise, it's like shooting in the dark.
On Fri, May 13, 2016 at 7:20 PM, Ted Yu wrote:
> Have you taken a look at SPARK-11293 ?
>
> Consider using repartition to increase the number of partitions.
>
> FYI
>
> On Fri, May 13, 2016 at 12:14 PM, Sun
Hello,
I'm using Spark version 1.6.0 and have trouble with memory when trying to
do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
following exception when I run the task.
There are 20 workers in the cluster. It is running under the standalone
mode with 12 GB assigned pe
Hello,
Say, that I'm doing a simple rdd.map followed by collect. Say, also, that
one of the executors finish all of its tasks, but there are still other
executors running.
If the machine that hosted the finished executor gets terminated, does the
master still have the results from the finished ta
tOnCancel set, then you
> can catch the interrupt exception within the Task.
>
> On Wed, Apr 6, 2016 at 12:24 PM, Sung Hwan Chung
> wrote:
>
>> Hi,
>>
>> I'm looking for ways to add shutdown hooks to executors : i.e., when a
>> Job is forcefully terminated
Hi,
I'm looking for ways to add shutdown hooks to executors : i.e., when a Job
is forcefully terminated before it finishes.
The scenario goes likes this : executors are running a long running job
within a 'map' function. The user decides to terminate the job, then the
mappers should perform some
AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 March 2016 at 23:30, Sung Hwan Chung
> wrote:
>
>> Yea, that seems to be the case. It seems that dynamically resizing a
>> standalone Spark cluster is very
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 March 2016 at 22:58, Sung Hwan Chung
> wrote:
>
>> It seems that the conf/slaves file is only for consumption by the
>> following scripts:
>>
>> sbin/start-slaves.sh
>> sbin/stop-slaves.sh
It seems that the conf/slaves file is only for consumption by the following
scripts:
sbin/start-slaves.sh
sbin/stop-slaves.sh
sbin/start-all.sh
sbin/stop-all.sh
I.e., conf/slaves file doesn't affect a running cluster.
Is this true?
On Mon, Mar 28, 2016 at 9:31 PM, Sung Hwan Chung
edin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 March 2016 at 22:06, Sung Hwan Chung
> wrote:
>
>> Hello,
>>
>> I found that I could dynamically add/remove new workers to a runn
Hello,
I found that I could dynamically add/remove new workers to a running
standalone Spark cluster by simply triggering:
start-slave.sh (SPARK_MASTER_ADDR)
and
stop-slave.sh
E.g., I could instantiate a new AWS instance and just add it to a running
cluster without needing to add it to slaves
Hello,
We are using the default compression codec for Parquet when we store our
dataframes. The dataframe has a StringType column whose values can be upto
several MBs large.
The funny thing is that once it's stored, we can browse the file content
with a plain text editor and see large portions of
t; yes, you can add/remove instances to the cluster on fly (CORE instances
>> support add only, TASK instances - add and remove)
>>
>>
>>
>> On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> I noticed
, Alexander Pivovarov
wrote:
> you can use EMR-4.3.0 run on spot instances to control the price
>
> yes, you can add/remove instances to the cluster on fly (CORE instances
> support add only, TASK instances - add and remove)
>
>
>
> On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwa
I noticed that in the main branch, the ec2 directory along with the
spark-ec2 script is no longer present.
Is spark-ec2 going away in the next release? If so, what would be the best
alternative at that time?
A couple more additional questions:
1. Is there any way to add/remove additional workers
is issue? That will help us a lot to improve
> TorrentBroadcast.
>
> Thanks!
>
> On Fri, Oct 10, 2014 at 8:46 AM, Sung Hwan Chung
> wrote:
> > I haven't seen this at all since switching to HttpBroadcast. It seems
> > TorrentBroadcast might have some issues?
> &
Un-needed checkpoints are not getting automatically deleted in my
application.
I.e. the lineage looks something like this and checkpoints simply
accumulate in a temporary directory (every lineage point, however, does zip
with a globally permanent):
PermanentRDD:Global zips with all the interm
I haven't seen this at all since switching to HttpBroadcast. It seems
TorrentBroadcast might have some issues?
On Thu, Oct 9, 2014 at 4:28 PM, Sung Hwan Chung
wrote:
> I don't think that I saw any other error message. This is all I saw.
>
> I'm currently experimenti
I'm getting DFS closed channel exception every now and then when I run
checkpoint. I do checkpointing every 15 minutes or so. This happens usually
after running the job for 1~2 hours. Anyone seen this before?
Job aborted due to stage failure: Task 6 in stage 70.0 failed 4 times,
most recent failur
I don't think that I saw any other error message. This is all I saw.
I'm currently experimenting to see if this can be alleviated by using
HttpBroadcastFactory instead of TorrentBroadcast. So far, with
HttpBroadcast, I haven't seen this recurring as of yet. I'll keep you
posted.
On Thu, Oct 9, 20
makes the ordering
> deterministic.
>
> On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
> wrote:
> > Let's say you have some rows in a dataset (say X partitions initially).
> >
> > A
> > B
> > C
> > D
> > E
> > .
> > .
> > .
> >
rve the ordering of an RDD.
>
> On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
> wrote:
> > I noticed that repartition will result in non-deterministic lineage
> because
> > it'll result in changed orders for rows.
> >
> > So for instance, if you do thin
This is also happening to me on a regular basis, when the job is large with
relatively large serialized objects used in each RDD lineage. A bad thing
about this is that this exception always stops the whole job.
On Fri, Sep 26, 2014 at 11:17 AM, Brad Miller
wrote:
> FWIW I suspect that each cou
I noticed that repartition will result in non-deterministic lineage because
it'll result in changed orders for rows.
So for instance, if you do things like:
val data = read(...)
val k = data.repartition(5)
val h = k.repartition(5)
It seems that this results in different ordering of rows for 'k'
12:13 PM, Akshat Aranya wrote:
>
> Using a var for RDDs in this way is not going to work. In this example,
> tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after
> that, you change what tx2 means, so you would end up having a circular
> dependency.
>
&
ing a var for RDDs in this way is not going to work. In this example,
> tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after
> that, you change what tx2 means, so you would end up having a circular
> dependency.
>
>> On Wed, Oct 8, 2014 at 12:01 PM, Sung H
My job is not being fault-tolerant (e.g., when there's a fetch failure or
something).
The lineage of RDDs are constantly updated every iteration. However, I
think that when there's a failure, the lineage information is not being
correctly reapplied.
It goes something like this:
val rawRDD = read
Is the RDD partition index you get when you call mapPartitionWithIndex
consistent under fault-tolerance condition?
I.e.
1. Say index is 1 for one of the partitions when you call
data.mapPartitionWithIndex((index, rows) => ) // Say index is 1
2. The partition fails (maybe a long with a bunch o
Nevermind. Just creating an empty hadoop configuration from executors did
the trick.
On Thu, Jul 31, 2014 at 6:16 PM, Sung Hwan Chung
wrote:
> Is there any way to get SparkContext object from executor? Or hadoop
> configuration, etc. The reason is that I would like to write to HDF
Is there any way to get SparkContext object from executor? Or hadoop
configuration, etc. The reason is that I would like to write to HDFS from
executors.
gt; you will get back 11 for both statuses.
> >
> >
> > 2014-07-28 15:40 GMT-07:00 Sung Hwan Chung :
> >
> >> Do getExecutorStorageStatus and getExecutorMemoryStatus both return the
> >> number of executors + the driver?
> >> E.g., if I submit a job with 10
14/07/30 16:08:00 INFO Executor: Running task ID 1199
14/07/30 16:08:00 INFO BlockManager: Found block broadcast_0 locally
14/07/30 16:08:00 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/07/30 16:08:00 INFO BlockFetcherIterator$Basic
I sometimes see that after fully caching the data, if one of the executors
fails for some reason, that portion of cache gets lost and does not get
re-cached, even though there are plenty of resources. Is this a bug or a
normal behavior (V1.0.1)?
Do getExecutorStorageStatus and getExecutorMemoryStatus both return the
number of executors + the driver?
E.g., if I submit a job with 10 executors, I get 11 for
getExeuctorStorageStatus.length and getExecutorMemoryStatus.size
On Thu, Jul 24, 2014 at 4:53 PM, Nicolas Mai wrote:
> Thanks, this i
, now groupby followed by collect is very fast.
On Sat, Jun 28, 2014 at 12:03 AM, Sung Hwan Chung
wrote:
> I'm finding the following messages in the driver. Can this potentially
> have anything to do with these drastic slowdowns?
>
>
> 14/06/28 00:00:17 INFO ShuffleBlockMana
14 at 11:35 PM, Sung Hwan Chung
wrote:
> I'm doing something like this:
>
> rdd.groupBy.map().collect()
>
> The work load on final map is pretty much evenly distributed.
>
> When collect happens, say on 60 partitions, the first 55 or so partitions
> finish very quic
I'm doing something like this:
rdd.groupBy.map().collect()
The work load on final map is pretty much evenly distributed.
When collect happens, say on 60 partitions, the first 55 or so partitions
finish very quickly say within 10 seconds. However, the last 5,
particularly the very last one, typic
I'm seeing the following message in the log of an executor. Anyone
seen this error? After this, the executor seems to lose the cache, and
but besides that the whole thing slows down drastically - I.e. it gets
stuck in a reduce phase for 40+ minutes, whereas before it was
finishing reduces in 2~3 se
I'm doing coalesce with shuffle, cache and then do thousands of iterations.
I noticed that sometimes Spark would for no particular reason perform
partial coalesce again after running for a long time - and there was no
exception or failure on the worker's part.
Why is this happening?
Hi,
When I try requesting a large number of executors - e.g. 242, it doesn't
seem to actually reach that number. E.g., under the executors tab, I only
see an executor ID of upto 234.
This despite the fact that there're plenty more memory available as well as
CPU cores, etc in the system. In fact,
I suppose what I want is the memory efficiency of toLocalIterator and the
speed of collect. Is there any such thing?
On Mon, Jun 9, 2014 at 3:19 PM, Sung Hwan Chung
wrote:
> Hello,
>
> I noticed that the final reduce function happens in the driver node with a
> code that lo
Hello,
I noticed that the final reduce function happens in the driver node with a
code that looks like the following.
val outputMap = mapPartition(domsomething).reduce(a: Map, b: Map) {
a.merge(b)
}
although individual outputs from mappers are small. Over time the
aggregated result outputMap co
Additionally, I've encountered some confusing situation where the locality
level for a task showed up as 'PROCESS_LOCAL' even though I didn't cache
the data. I wonder some implicit caching happens even without the user
specifying things.
On Thu, Jun 5, 2014 at 3:50 PM, Su
ant to prevent this from happening entirely, you can set the
> values to ridiculously high numbers. The documentation also mentions that
> "0" has special meaning, so you can try that as well.
>
> Good luck!
> Andrew
>
>
> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwa
s the case?
On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung
wrote:
> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume
> that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
>
> When these happen things get extremely slow.
>
> Does thi
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume
that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
When these happen things get extremely slow.
Does this mean that the executor got terminated and restarted?
Is there a way to prevent this from happening (ba
Nevermind, it turns out that this is a problem for the Pivotal Hadoop that
we are trying to compile against.
On Wed, Jun 4, 2014 at 4:16 PM, Sung Hwan Chung
wrote:
> When I run sbt/sbt assembly, I get the following exception. Is anyone else
> experiencing a similar p
When I run sbt/sbt assembly, I get the following exception. Is anyone else
experiencing a similar problem?
..
[info] Resolving org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016
...
[info] Updating {file:/Users/Sung/Projects/spark_06_04_14/}assembly...
[info] Resolving org.fuses
Hi,
Is there a programmatic way of checking whether RDD has been 100% cached or
not? I'd like to do this to have two different path ways.
Additionally, how do you clear cache (e.g. if you want to cache different
RDDs, and you'd like to clear an existing cached RDD).
Thanks!
what I said came through. RDD zip is not hacky at all, as
>> it only depends on a user not changing the partitioning. Basically, you
>> would keep your losses as an RDD[Double] and zip whose with the RDD of
>> examples, and update the losses. You're doing a copy (and GC) on
multiple
> threads/partitions on one host so the map should be viewed as shared
> amongst a larger space.
>
>
>
> Also with your exact description it sounds like your data should be
> encoded into the RDD if its per-record/per-row: RDD[(MyBaseData,
> LastIterationSideValues)
2014 at 1:35 AM, Sean Owen wrote:
> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung > wrote:
>
>> Actually, I do not know how to do something like this or whether this is
>> possible - thus my suggestive statement.
>>
>> Can you already declare persistent memo
ed to actually serialize singletons and pass it
back and forth in a weird manner.
On Mon, Apr 28, 2014 at 1:23 AM, Sean Owen wrote:
> On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung > wrote:
>>
>> e.g. something like
>>
>> rdd.mapPartition((rows : Iterator[String]) =>
Yes, this is a useful trick we found that made our algorithm implementation
noticeably faster (btw, we'll send a pull request for this GLMNET
implementation, so interested people could look at it).
It would be nice if Spark supported something akin to this natively, as I
believe that many efficien
tation.
On Mon, Apr 21, 2014 at 11:15 AM, Marcelo Vanzin wrote:
> Hi Sung,
>
> On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung
> wrote:
> > The goal is to keep an intermediate value per row in memory, which would
> > allow faster subsequent computations. I.e., computeS
, 2014 at 5:54 PM, Marcelo Vanzin wrote:
> Hi Sung,
>
> On Fri, Apr 18, 2014 at 5:11 PM, Sung Hwan Chung
> wrote:
> > while (true) {
> > rdd.map((row : Array[Double]) => {
> > row[numCols - 1] = computeSomething(row)
> > }).reduce(...)
> > }
> &
Are there scenarios where the developers have to be aware of how Spark's
fault tolerance works to implement correct programs?
It seems that if we want to maintain any sort of mutable state in each
worker through iterations, it can have some unintended effect once a
machine goes down.
E.g.,
while
Sorry, that was incomplete information, I think Spark's compression helped
(not sure how much though) that the actual memory requirement may have been
smaller.
On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung
wrote:
> I would argue that memory in clusters is still a limited resource
st clusters running Spark.
>
> YARN integration is actually complete in CDH5.0. We support it as well as
> standalone mode.
>
>
>
>
> On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen wrote:
>
>> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
>> wrote:
>> > D
e them to
> decrease features right ?
>
> A feature extraction algorithm like matrix factorization and it's variants
> could be used to decrease feature space as well...
>
>
>
> On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu>
'd estimate a couple of gigs
> necessary for heap space for the worker to compute/store the histograms,
> and I guess 2x that on the master to do the reduce.
>
> Again 2GB per worker is pretty tight, because there are overheads of just
> starting the jvm, launching a worker, l
with the
> memory overheads in model training when trees get deep - if someone wants
> to modify the current implementation of trees in MLlib and contribute this
> optimization as a pull request, it would be welcome!
>
> At any rate, we'll take this feedback into account
do the paper also propose to grow a shallow tree ?
>
> Thanks.
> Deb
>
>
> On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung > wrote:
>
>> Additionally, the 'random features per node' (or mtry in R) is a very
>> important feature for Random Forest. The vari
have to be supported at the tree level.
On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung
wrote:
> Well, if you read the original paper,
> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
> "Grow the tree using CART methodology to maximum size and do not prune."
>
&g
ting. I agree with the assessment that forests are a variance
> reduction technique, but I'd be a little surprised if a bunch of hugely
> deep trees don't overfit to training data. I guess I view limiting tree
> depth as an analogue to regularization in linear models.
>
>
f the decision tree gets deep, but I'd recommend a lot more
> memory than 2-4GB per worker for most big data workloads.
>
>
>
>
>
> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Debasish, we've tested the M
Debasish, we've tested the MLLib decision tree a bit and it eats up too
much memory for RF purposes.
Once the tree got to depth 8~9, it was easy to get heap exception, even
with 2~4 GB of memory per worker.
With RF, it's very easy to get 100+ depth in RF with even only 100,000+
rows (because trees
ow table semantics.
>
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung > wrote:
>
>> Hey guys,
>>
>> I need to tag indiv
Hey guys,
I need to tag individual RDD lines with some values. This tag value would
change at every iteration. Is this possible with RDD (I suppose this is
sort of like mutable RDD, but it's more) ?
If not, what would be the best way to do something like this? Basically, we
need to keep mutable i
0.9.0?
>>
>>
>> On Wed, Mar 26, 2014 at 1:47 PM, Sandy Ryza wrote:
>>
>>> Hi Sung,
>>>
>>> Are you using yarn-standalone mode? Have you specified the --addJars
>>> option with your external jars?
>>>
>>>
ur external jars?
>
> -Sandy
>
>
> On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung > wrote:
>
>> Hello, (this is Yarn related)
>>
>> I'm able to load an external jar and use its classes within
>> ApplicationMaster. I wish to use this jar wit
Hello, (this is Yarn related)
I'm able to load an external jar and use its classes within
ApplicationMaster. I wish to use this jar within worker nodes, so I added
sc.addJar(pathToJar) and ran.
I get the following exception:
org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4
times
72 matches
Mail list logo