Re: Master registers itself at startup?

2014-04-13 Thread Gerd Koenig
@YouPeng, @Aaron many thanks for the memory-setting hint. That solved the issue, just increased it to the default val of 512MB thanks, Gerd On 14 April 2014 03:22, YouPeng Yang wrote: > Hi > > The 512MB is the default memory size which each executor needs. and > actually, your job does not ne

Re: Huge matrix

2014-04-13 Thread Guillaume Pitel
On 04/12/2014 06:35 PM, Xiaoli Li wrote: Hi Guillaume, This sounds a good idea to me. I am a newbie here. Could you further explain how will you determine which clusters to keep? According to the distance between each el

moving SparkContext around

2014-04-13 Thread Schein, Sagi
A few questions about the resilience of the client side of spark. what would happen if the client process crashes, can it reconstruct its state ? Suppose I just want to serialize it and reload it back is this possible ? More advanced use case, is there a way to move SparkContext between jvms/mac

how to count maps within a node?

2014-04-13 Thread Joe L
Hi, I want to count maps within a node and return it to the driver without too much shuffling. I think I can improve my performance by doing so. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-count-maps-within-a-node-tp4196.html Sent from the Apach

How to set spark worker memory size?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-spark-worker-memory-size-tp4195.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

how to count maps without shuffling too much data?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-count-maps-without-shuffling-too-much-data-tp4194.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Checkpoint Vs Cache

2014-04-13 Thread David Thomas
What is the difference between checkpointing and caching an RDD?

Re: Master registers itself at startup?

2014-04-13 Thread YouPeng Yang
Hi The 512MB is the default memory size which each executor needs. and actually, your job does not need as much as the default memory size. you can create a SparkContext with sc = new SparkContext("local-cluster[2,1,512]", "test") // suppose you use the local-cluster model. Here the 512 is the m

Re: Master registers itself at startup?

2014-04-13 Thread Aaron Davidson
This is usually due to a memory misconfiguration somewhere. Your job may be requesting that each executor has 512MB, and your cluster may not be able to satisfy that (if you're only allowing 64MB executors, for instance). Try setting spark.executor.memory to be the same as SPARK_WORKER_MEMORY. On

[no subject]

2014-04-13 Thread ge ko
Hi, I'm still going to start working with Spark and installed the parcels in our CDH5 GA cluster. Master: hadoop-pg-5.cluster, Worker: hadoop-pg-7.cluster Like some advices told me to use FQDN, the settings above sound reasonable for me . Both daemons are running, Master-Web-UI shows the co

Re: Master registers itself at startup?

2014-04-13 Thread Gerd Koenig
Many thanks for your explanation. So there's just my issue with that "TaskSchedulerImpl: Initial job has not accepted any resources" stuff that prevents me from starting with Spark (at least execute the examples successfully) ;) br, Gerd On 13 April 2014 10:17, Aaron Davidson wrote: > By the

Re: function state lost when next RDD is processed

2014-04-13 Thread Chris Fregly
or how about the UpdateStateByKey() operation? https://spark.apache.org/docs/0.9.0/streaming-programming-guide.html the StatefulNetworkWordCount example demonstrates how to keep state across RDDs. > On Mar 28, 2014, at 8:44 PM, Mayur Rustagi wrote: > > Are you referring to Spark Streaming? >

Re: Creating a SparkR standalone job

2014-04-13 Thread Shivaram Venkataraman
Thanks for attaching code. If I get your use case right you want to call the sentiment analysis code from Spark Streaming right ? For that I think you can just use jvmr if that works and I don't think you need SparkR. SparkR is mainly intended as an API for large scale jobs which are written in R.

how to use a single filter instead of multiple filters

2014-04-13 Thread Joe L
Hi, I have multiple filters as shown below, should I use a single optimal filter instead of them? these filters can degrade the performance of spark? -- View this message in context: http://apache-spark-user-list.1

Re: Spark - ready for prime time?

2014-04-13 Thread Andrew Ash
It's highly dependent on what the issue is with your particular job, but the ones I modify most commonly are: spark.storage.memoryFraction spark.shuffle.memoryFraction parallelism (a parameter on many RDD calls) -- increase from the default level to get more, smaller tasks that are more likely to

Re: Spark - ready for prime time?

2014-04-13 Thread Jim Blomo
On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash wrote: > The biggest issue I've come across is that the cluster is somewhat unstable > when under memory pressure. Meaning that if you attempt to persist an RDD > that's too big for memory, even with MEMORY_AND_DISK, you'll often still get > OOMs. I h

Re: what is the difference between persist() and cache()?

2014-04-13 Thread Andrea Esposito
AFAIK cache() is just a shortcut to the persist method with "MEMORY_ONLY" as storage level.. from the source code of RDD: > /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ > def persist(): RDD[T] = persist(StorageLevel.MEMORY_ONLY) > > /** Persist this RDD with the de

what is the difference between persist() and cache()?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-persist-and-cache-tp4181.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Master registers itself at startup?

2014-04-13 Thread Aaron Davidson
By the way, 64 MB of RAM per machine is really small, I'm surprised Spark can even start up on that! Perhaps you meant to set SPARK_DAEMON_MEMORY so that the actual worker process itself would be small, but SPARK_WORKER_MEMORY (which controls the amount of memory available for Spark executors) shou

Re: Master registers itself at startup?

2014-04-13 Thread Aaron Davidson
This was actually a bug in the log message itself, where the Master would print its own ip and port instead of the registered worker's. It has been fixed in 0.9.1 and 1.0.0 (here's the patch: https://github.com/apache/spark/commit/c0795cf481d47425ec92f4fd0780e2e0b3fdda85 ). Sorry about the confusi