Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Andrew Or
Hi all,

As the author of the dynamic allocation feature I can offer a few insights
here.

Gerard's explanation was both correct and concise: dynamic allocation is
not intended to be used in Spark streaming at the moment (1.4 or before).
This is because of two things:

(1) Number of receivers is necessarily fixed, and these are started in
executors. Since we need a receiver for each InputDStream, if we kill these
receivers we essentially stop the stream, which is not what we want. It
makes little sense to close and restart a stream the same way we kill and
relaunch executors.

(2) Records come in every batch, and when there is data to process your
executors are not idle. If your idle timeout is less than the batch
duration, then you'll end up having to constantly kill and restart
executors. If your idle timeout is greater than the batch duration, then
you'll never kill executors.

Long answer short, with Spark streaming there is currently no
straightforward way to scale the size of your cluster. I had a long
discussion with TD (Spark streaming lead) about what needs to be done to
provide some semblance of dynamic scaling to streaming applications, e.g.
take into account the batch queue instead. We came up with a few ideas that
I will not detail here, but we are looking into this and do intend to
support it in the near future.

-Andrew



2015-05-28 8:02 GMT-07:00 Evo Eftimov :

> Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK
> – it will be your insurance policy against sys crashes due to memory leaks.
> Until there is free RAM, spark streaming (spark) will NOT resort to disk –
> and of course resorting to disk from time to time (ie when there is no free
> RAM ) and taking a performance hit from that, BUT only until there is no
> free RAM
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
> *Sent:* Thursday, May 28, 2015 2:34 PM
> *To:* Evo Eftimov
> *Cc:* Gerard Maas; spark users
> *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
> sizes/rate of growth in Kafka or Spark's metrics?
>
>
>
> Evo, good points.
>
>
>
> On the dynamic resource allocation, I'm surmising this only works within a
> particular cluster setup.  So it improves the usage of current cluster
> resources but it doesn't make the cluster itself elastic. At least, that's
> my understanding.
>
>
>
> Memory + disk would be good and hopefully it'd take *huge* load on the
> system to start exhausting the disk space too.  I'd guess that falling onto
> disk will make things significantly slower due to the extra I/O.
>
>
>
> Perhaps we'll really want all of these elements eventually.  I think we'd
> want to start with memory only, keeping maxRate low enough not to overwhelm
> the consumers; implement the cluster autoscaling.  We might experiment with
> dynamic resource allocation before we get to implement the cluster
> autoscale.
>
>
>
>
>
>
>
> On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov 
> wrote:
>
> You can also try Dynamic Resource Allocation
>
>
>
>
> https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation
>
>
>
> Also re the Feedback Loop for automatic message consumption rate
> adjustment – there is a “dumb” solution option – simply set the storage
> policy for the DStream RDDs to MEMORY AND DISK – when the memory gets
> exhausted spark streaming will resort to keeping new RDDs on disk which
> will prevent it from crashing and hence loosing them. Then some memory will
> get freed and it will resort back to RAM and so on and so forth
>
>
>
>
>
> Sent from Samsung Mobile
>
>  Original message 
>
> From: Evo Eftimov
>
> Date:2015/05/28 13:22 (GMT+00:00)
>
> To: Dmitry Goldenberg
>
> Cc: Gerard Maas ,spark users
>
> Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
> in Kafka or Spark's metrics?
>
>
>
> You can always spin new boxes in the background and bring them into the
> cluster fold when fully operational and time that with job relaunch and
> param change
>
>
>
> Kafka offsets are mabaged automatically for you by the kafka clients which
> keep them in zoomeeper dont worry about that ad long as you shut down your
> job gracefuly. Besides msnaging the offsets explicitly is not a big deal if
> necessary
>
>
>
>
>
> Sent from Samsung Mobile
>
>
>
>  Original message 
>
> From: Dmitry Goldenberg
>
> Date:2015/05/28 13:16 (GMT+00:00)
>
> To: Evo Eftimov
>
> Cc: Gerard Maas ,spark users
>
> Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
> in Kafka or Spark's metrics?
>
>
>
> Thanks, Evo.  Per the last part of your comment, it sounds like we will
> need to implement a job manager which will be in control of starting the
> jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
> marking them as ones to relaunch, scaling the cluster up/down by
> adding/removing machines, and relaunching the 'suspended' (shut down) jobs.
>
>
>
> I suspect that relaunching the j

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Andrew Or
Hi Deepak,

This is a notorious bug that is being tracked at
https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one source
of this bug (it turns out Snappy had a bug in buffer reuse that caused data
corruption). There are other known sources that are being addressed in
outstanding patches currently.

Since you're using 1.3.1 my guess is that you don't have this patch:
https://github.com/apache/spark/pull/6176, which I believe should fix the
issue in your case. It's merged for 1.3.2 (not yet released) but not in
time for 1.3.1, so feel free to patch it yourself and see if it works.

-Andrew


2015-06-01 8:00 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :

> Any suggestions ?
>
> I using Spark 1.3.1 to read   sequence file stored in Sequence File format
> (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
> )
>
> with this code and settings
> sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
> org.apache.spark.HashPartitioner(2053))
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryoserializer.buffer.mb",
> arguments.get("buffersize").get)
>   .set("spark.kryoserializer.buffer.max.mb",
> arguments.get("maxbuffersize").get)
>   .set("spark.driver.maxResultSize",
> arguments.get("maxResultSize").get)
>   .set("spark.yarn.maxAppAttempts", "0")
>   //.set("spark.akka.askTimeout", arguments.get("askTimeout").get)
>   //.set("spark.akka.timeout", arguments.get("akkaTimeout").get)
>   //.set("spark.worker.timeout", arguments.get("workerTimeout").get)
>
> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
>
>
> and values are
> buffersize=128 maxbuffersize=1068 maxResultSize=200G
>
>
> And i see this exception in each executor task
>
> FetchFailed(BlockManagerId(278, phxaishdc9dn1830.stratus.phx.ebay.com,
> 54757), shuffleId=6, mapId=2810, reduceId=1117, message=
>
> org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> at
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)
>
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> *Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)*
>
> at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
>
> at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>
> at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
>
> at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
>
> at
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
>
> at
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
>
> at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>
> at
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)
>
> at
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1165)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:301)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)
>
> at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at scala.util.Success.map(Try.scala:206)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(Shuffle

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Andrew Or
Hi Peng,

Setting properties through --conf should still work in Spark 1.4. From the
warning it looks like the config you are trying to set does not start with
the prefix "spark.". What is the config that you are trying to set?

-Andrew

2015-06-12 11:17 GMT-07:00 Peng Cheng :

> In Spark <1.3.x, the system property of the driver can be set by --conf
> option, shared between setting spark properties and system properties.
>
> In Spark 1.4.0 this feature is removed, the driver instead log the
> following
> warning:
>
> Warning: Ignoring non-spark config property: xxx.xxx=v
>
> How do set driver's system property in 1.4.0? Is there a reason it is
> removed without a deprecation warning?
>
> Thanks a lot for your advices.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-How-to-set-driver-s-system-property-using-spark-submit-options-tp23298.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Andrew Or
Hi Patrick,

The fix you need is SPARK-6954: https://github.com/apache/spark/pull/5704.
If possible, you may cherry-pick the following commit into your Spark
deployment and it should resolve the issue:

https://github.com/apache/spark/commit/98ac39d2f5828fbdad8c9a4e563ad1169e3b9948

Note that this commit is only for the 1.3 branch. If you could upgrade to
1.4.0 then you do not need to apply that commit yourself.

-Andrew



2015-06-13 12:01 GMT-07:00 Patrick Woody :

> Hey Sandy,
>
> I'll test it out on 1.4. Do you have a bug number or PR that I could
> reference as well?
>
> Thanks!
> -Pat
>
> Sent from my iPhone
>
> On Jun 13, 2015, at 11:38 AM, Sandy Ryza  wrote:
>
> Hi Patrick,
>
> I'm noticing that you're using Spark 1.3.1.  We fixed a bug in dynamic
> allocation in 1.4 that permitted requesting negative numbers of executors.
> Any chance you'd be able to try with the newer version and see if the
> problem persists?
>
> -Sandy
>
> On Fri, Jun 12, 2015 at 7:42 PM, Patrick Woody 
> wrote:
>
>> Hey all,
>>
>> I've recently run into an issue where spark dynamicAllocation has asked
>> for -1 executors from YARN. Unfortunately, this raises an exception that
>> kills the executor-allocation thread and the application can't request more
>> resources.
>>
>> Has anyone seen this before? It is spurious and the application usually
>> works, but when this gets hit it becomes unusable when getting stuck at
>> minimum YARN resources.
>>
>> Stacktrace below.
>>
>> Thanks!
>> -Pat
>>
>> 470 ERROR [2015-06-12 16:44:39,724] org.apache.spark.util.Utils: Uncaught
>> exception in thread spark-dynamic-executor-allocation-0
>> 471 ! java.lang.IllegalArgumentException: Attempted to request a negative
>> number of executor(s) -1 from the cluster manager. Please specify a
>> positive number!
>> 472 ! at
>> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
>> ~[spark-core_2.10-1.3.1.jar:1.
>> 473 ! at
>> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 474 ! at
>> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 475 ! at
>> org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 476 ! at 
>> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
>> ~[spark-core_2.10-1.3.1.j
>> 477 ! at
>> org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 478 ! at
>> org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 479 ! at
>> org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 480 ! at
>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 481 ! at
>> org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
>> [spark-core_2.10-1.3.1.jar:1.3.1]
>> 482 ! at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> [na:1.7.0_71]
>> 483 ! at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>> [na:1.7.0_71]
>> 484 ! at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> [na:1.7.0_71]
>> 485 ! at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> [na:1.7.0_71]
>> 486 ! at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> [na:1.7.0_71]
>> 487 ! at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> [na:1.7.0_71]
>>
>
>


Re: PySpark on YARN "port out of range"

2015-06-19 Thread Andrew Or
Hm, one thing to see is whether the same port appears many times (1315905645).
The way pyspark works today is that the JVM reads the port from the stdout
of the python process. If there is some interference in output from the
python side (e.g. any print statements, exception messages), then the Java
side will think that it's actually a port even when it's not.

I'm not sure why it fails sometimes but not others, but 2/3 of the time is
a lot...

2015-06-19 14:57 GMT-07:00 John Meehan :

> Has anyone encountered this “port out of range” error when launching
> PySpark jobs on YARN?  It is sporadic (e.g. 2/3 jobs get this error).
>
> LOG:
>
> 15/06/19 11:49:44 INFO scheduler.TaskSetManager: Lost task 0.3 in stage
> 39.0 (TID 211) on executor xxx.xxx.xxx.com:
> java.lang.IllegalArgumentException (port out of range:1315905645)
> [duplicate 7]
> Traceback (most recent call last):
>  File "", line 1, in 
> 15/06/19 11:49:44 INFO cluster.YarnScheduler: Removed TaskSet 39.0, whose
> tasks have all completed, from pool
>  File "/home/john/spark-1.4.0/python/pyspark/rdd.py", line 745, in collect
>port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>  File
> "/home/john/spark-1.4.0/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>  File
> "/home/john/spark-1.4.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError15/06/19 11:49:44 INFO
> storage.BlockManagerInfo: Removed broadcast_38_piece0 on
> 17.134.160.35:47455 in memory (size: 2.2 KB, free: 265.4 MB)
> : An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 1 in stage 39.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 39.0 (TID 210, xxx.xxx.xxx.com): java.lang.IllegalArgumentException: port
> out of range:1315905645
> at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
> at java.net.InetSocketAddress.(InetSocketAddress.java:185)
> at java.net.Socket.(Socket.java:241)
> at
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
> at
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
> at
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
> at
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> * Spark 1.4.0 build:
>
> build/mvn -Pyarn -Phive -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.4
> -DskipTests clean package
>
> LAUNCH CMD:
>
> export HADOOP_CONF_DIR=/path/to/conf
> export PYSPARK_PYTHON=/path/to/python-2.7.2/bin/python
> ~/spark-1.4.0/bin/pyspark \
> --conf
> spark.yarn.jar=/home/john/spark-1.4.0/assembly/target/scala-2.10/spark-assembly-1.4.0-hadoop2.3.0-cdh5.1.4.jar
> \
> --master yarn-client \
> --num-executors 3 \
> --executor-cores 18 \
> --executor-memory 48g
>
> TEST JOB IN REPL:
>
> words = [‘hi’, ‘there’, ‘yo’, ‘baby’]
> wordsRdd = sc.parallelize(words)
> words.map

Re: Spark on Yarn - How to configure

2015-06-19 Thread Andrew Or
Hi Ashish,

For Spark on YARN, you actually only need the Spark files on one machine -
the submission client. This machine could even live outside of the cluster.
Then all you need to do is point YARN_CONF_DIR to the directory containing
your hadoop configuration files (e.g. yarn-site.xml) on that machine. All
the jars will be automatically distributed to the nodes in the cluster
accordingly.

-Andrew

2015-06-19 12:35 GMT-07:00 Ashish Soni :

> Can some one please let me know what all i need to configure to have Spark
> run using Yarn ,
>
> There is lot of documentation but none of it says how and what all files
> needs to be changed
>
> Let say i have 4 node for Spark - SparkMaster , SparkSlave1 , SparkSlave2
> , SparkSlave3
>
> Now in which node which files needs to changed to make sure my master node
> is SparkMaster and slave nodes are 1,2,3 and how to tell / configure Yarn
>
> Ashish
>


Re: Abount Jobs UI in yarn-client mode

2015-06-19 Thread Andrew Or
Did you make sure that the YARN IP is not an internal address? If it still
doesn't work then it seems like an issue on the YARN side...

2015-06-19 8:48 GMT-07:00 Sea <261810...@qq.com>:

> Hi, all:
> I run spark on yarn,  I want to see the Jobs UI http://ip:4040/,
> but it redirect to http://
> ${yarn.ip}/proxy/application_1428110196022_924324/ which can not be
> found. Why?
> Anyone can help?
>


Re: What files/folders/jars spark-submit script depend on ?

2015-06-19 Thread Andrew Or
Hi Elkhan,

Spark submit depends on several things: the launcher jar (1.3.0+ only), the
spark-core jar, and the spark-yarn jar (in your case). Why do you want to
put it in HDFS though? AFAIK you can't execute scripts directly from HDFS;
you need to copy them to a local file system first. I don't see clear
benefits of not just running Spark submit from source or from one of the
distributions.

-Andrew

2015-06-19 10:12 GMT-07:00 Elkhan Dadashov :

> Hi all,
>
> If I want to ship spark-submit script to HDFS. and then call it from HDFS
> location for starting Spark job, which other files/folders/jars need to be
> transferred into HDFS with spark-submit script ?
>
> Due to some dependency issues, we can include Spark in our Java
> application, so instead we will allow limited usage of Spark only with
> Python files.
>
> So if I want to put spark-submit script into HDFS, and call it to execute
> Spark job in Yarn cluster, what else need to be put into HDFS with it ?
>
> (Using Spark only for execution Spark jobs written in Python)
>
> Thanks.
>
>


Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Andrew Or
Hi Raghav,

If you want to make changes to Spark and run your application with it, you
may follow these steps.

1. git clone g...@github.com:apache/spark
2. cd spark; build/mvn clean package -DskipTests [...]
3. make local changes
4. build/mvn package -DskipTests [...] (no need to clean again here)
5. bin/spark-submit --master spark://[...] --class your.main.class your.jar

No need to pass in extra --driver-java-options or --driver-extra-classpath
as others have suggested. When using spark-submit, the main jar comes from
assembly/target/scala_2.10, which is prepared through "mvn package". You
just have to make sure that you re-package the assembly jar after each
modification.

-Andrew

2015-06-18 16:35 GMT-07:00 maxdml :

> You can specify the jars of your application to be included with
> spark-submit
> with the /--jars/ switch.
>
> Otherwise, are you sure that your newly compiled spark jar assembly is in
> assembly/target/scala-2.10/?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-Applications-using-Spark-Submit-tp23352p23400.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Andrew Or
Hi Raghav,

I'm assuming you're using standalone mode. When using the Spark EC2 scripts
you need to make sure that every machine has the most updated jars. Once
you have built on one of the nodes, you must *rsync* the Spark directory to
the rest of the nodes (see /root/spark-ec2/copy-dir).

That said, I usually build it locally on my laptop and *scp* the assembly
jar to the cluster instead of building it there. The EC2 machines often
take much longer to build for some reason. Also it's cumbersome to set up
proper IDE there.

-Andrew


2015-06-19 19:11 GMT-07:00 Raghav Shankar :

> Thanks Andrew! Is this all I have to do when using the spark ec2 script to
> setup a spark cluster? It seems to be getting an assembly jar that is not
> from my project(perhaps from a maven repo). Is there a way to make the
> ec2 script use the assembly jar that I created?
>
> Thanks,
> Raghav
>
>
> On Friday, June 19, 2015, Andrew Or  wrote:
>
>> Hi Raghav,
>>
>> If you want to make changes to Spark and run your application with it,
>> you may follow these steps.
>>
>> 1. git clone g...@github.com:apache/spark
>> 2. cd spark; build/mvn clean package -DskipTests [...]
>> 3. make local changes
>> 4. build/mvn package -DskipTests [...] (no need to clean again here)
>> 5. bin/spark-submit --master spark://[...] --class your.main.class
>> your.jar
>>
>> No need to pass in extra --driver-java-options or
>> --driver-extra-classpath as others have suggested. When using spark-submit,
>> the main jar comes from assembly/target/scala_2.10, which is prepared
>> through "mvn package". You just have to make sure that you re-package the
>> assembly jar after each modification.
>>
>> -Andrew
>>
>> 2015-06-18 16:35 GMT-07:00 maxdml :
>>
>>> You can specify the jars of your application to be included with
>>> spark-submit
>>> with the /--jars/ switch.
>>>
>>> Otherwise, are you sure that your newly compiled spark jar assembly is in
>>> assembly/target/scala-2.10/?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-Applications-using-Spark-Submit-tp23352p23400.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: PySpark on YARN "port out of range"

2015-06-22 Thread Andrew Or
Unfortunately there is not a great way to do it without modifying Spark to
print more things it reads from the stream.

2015-06-20 23:10 GMT-07:00 John Meehan :

> Yes it seems to be consistently "port out of range:1315905645”.  Is there
> any way to see what the python process is actually outputting (in hopes
> that yields a clue)?
>
> On Jun 19, 2015, at 6:47 PM, Andrew Or  wrote:
>
> Hm, one thing to see is whether the same port appears many times (1315905645).
> The way pyspark works today is that the JVM reads the port from the stdout
> of the python process. If there is some interference in output from the
> python side (e.g. any print statements, exception messages), then the Java
> side will think that it's actually a port even when it's not.
>
> I'm not sure why it fails sometimes but not others, but 2/3 of the time is
> a lot...
>
> 2015-06-19 14:57 GMT-07:00 John Meehan :
>
>> Has anyone encountered this “port out of range” error when launching
>> PySpark jobs on YARN?  It is sporadic (e.g. 2/3 jobs get this error).
>>
>> LOG:
>>
>> 15/06/19 11:49:44 INFO scheduler.TaskSetManager: Lost task 0.3 in stage
>> 39.0 (TID 211) on executor xxx.xxx.xxx.com:
>> java.lang.IllegalArgumentException (port out of range:1315905645)
>> [duplicate 7]
>> Traceback (most recent call last):
>>  File "", line 1, in 
>> 15/06/19 11:49:44 INFO cluster.YarnScheduler: Removed TaskSet 39.0, whose
>> tasks have all completed, from pool
>>  File "/home/john/spark-1.4.0/python/pyspark/rdd.py", line 745, in collect
>>port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>>  File
>> "/home/john/spark-1.4.0/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>  File
>> "/home/john/spark-1.4.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>> py4j.protocol.Py4JJavaError15/06/19 11:49:44 INFO
>> storage.BlockManagerInfo: Removed broadcast_38_piece0 on
>> 17.134.160.35:47455 in memory (size: 2.2 KB, free: 265.4 MB)
>> : An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 1 in stage 39.0 failed 4 times, most recent failure: Lost task 1.3 in stage
>> 39.0 (TID 210, xxx.xxx.xxx.com): java.lang.IllegalArgumentException:
>> port out of range:1315905645
>> at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>> at java.net.InetSocketAddress.(InetSocketAddress.java:185)
>> at java.net.Socket.(Socket.java:241)
>> at
>> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
>> at
>> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
>> at
>> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
>> at
>> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
>> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
>> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> <http://org.apache.spark.scheduler.dagscheduler.org/>
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at
>> org.apache.spark.sched

Re: Submitting Spark Applications using Spark Submit

2015-06-22 Thread Andrew Or
ka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Also, in the above error it says:* connection refused to
> ec2-XXX.compute-1.amazonaws.com/10.165.103.16:7077
> <http://ec2-XXX.compute-1.amazonaws.com/10.165.103.16:7077> *I don’t
> understand where it gets the *10.165.103.16
> <http://ec2-XXX.compute-1.amazonaws.com/10.165.103.16:7077> *from. I
> never specify that in the master url command line parameter. Any ideas on
> what I might be doing wrong?
>
>
> On Jun 19, 2015, at 7:19 PM, Andrew Or  wrote:
>
> Hi Raghav,
>
> I'm assuming you're using standalone mode. When using the Spark EC2
> scripts you need to make sure that every machine has the most updated jars.
> Once you have built on one of the nodes, you must rsync the Spark directory
> to the rest of the nodes (see /root/spark-ec2/copy-dir).
>
> That said, I usually build it locally on my laptop and scp the assembly
> jar to the cluster instead of building it there. The EC2 machines often
> take much longer to build for some reason. Also it's cumbersome to set up
> proper IDE there.
>
> -Andrew
>
>
> 2015-06-19 19:11 GMT-07:00 Raghav Shankar :
> Thanks Andrew! Is this all I have to do when using the spark ec2 script to
> setup a spark cluster? It seems to be getting an assembly jar that is not
> from my project(perhaps from a maven repo). Is there a way to make the ec2
> script use the assembly jar that I created?
>
> Thanks,
> Raghav
>
>
> On Friday, June 19, 2015, Andrew Or  wrote:
> Hi Raghav,
>
> If you want to make changes to Spark and run your application with it, you
> may follow these steps.
>
> 1. git clone g...@github.com:apache/spark
> 2. cd spark; build/mvn clean package -DskipTests [...]
> 3. make local changes
> 4. build/mvn package -DskipTests [...] (no need to clean again here)
> 5. bin/spark-submit --master spark://[...] --class your.main.class your.jar
>
> No need to pass in extra --driver-java-options or --driver-extra-classpath
> as others have suggested. When using spark-submit, the main jar comes from
> assembly/target/scala_2.10, which is prepared through "mvn package". You
> just have to make sure that you re-package the assembly jar after each
> modification.
>
> -Andrew
>
> 2015-06-18 16:35 GMT-07:00 maxdml :
> You can specify the jars of your application to be included with
> spark-submit
> with the /--jars/ switch.
>
> Otherwise, are you sure that your newly compiled spark jar assembly is in
> assembly/target/scala-2.10/?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-Applications-using-Spark-Submit-tp23352p23400.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


Re: Disable heartbeat messages in REPL

2015-07-08 Thread Andrew Or
Hi Lincoln, I've noticed this myself. I believe it's a new issue that only
affects local mode. I've filed a JIRA to track it:
https://issues.apache.org/jira/browse/SPARK-8911

2015-07-08 14:20 GMT-07:00 Lincoln Atkinson :

>  Brilliant! Thanks.
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Wednesday, July 08, 2015 2:15 PM
> *To:* Lincoln Atkinson
> *Cc:* user@spark.apache.org
> *Subject:* Re: Disable heartbeat messages in REPL
>
>
>
> I was thinking the same thing! Try sc.setLogLevel("ERROR")
>
>
>
> On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson 
> wrote:
>
>  “WARN Executor: Told to re-register on heartbeat” is logged repeatedly
> in the spark shell, which is very distracting and corrupts the display of
> whatever set of commands I’m currently typing out.
>
>
>
> Is there an option to disable the logging of this message?
>
>
>
> Thanks,
>
> -Lincoln
>
>
>


Re: Spark serialization in closure

2015-07-09 Thread Andrew Or
Hi Chen,

I believe the issue is that `object foo` is a member of `object testing`,
so the only way to access `object foo` is to first pull `object testing`
into the closure, then access a pointer to get to `object foo`. There are
two workarounds that I'm aware of:

(1) Move `object foo` outside of `object testing`. This is only a problem
because of the nested objects. Also, by design it's simpler to reason about
but that's a separate discussion.

(2) Create a local variable for `foo.v`. If all your closure cares about is
the integer, then it makes sense to add a `val v = foo.v` inside `func` and
use this in your closure instead. This avoids pulling in $outer pointers
into your closure at all since it only references local variables.

As others have commented, I think this is more of a Scala problem than a
Spark one.

Let me know if these work,
-Andrew

2015-07-09 13:36 GMT-07:00 Richard Marscher :

> Reading that article and applying it to your observations of what happens
> at runtime:
>
> shouldn't the closure require serializing testing? The foo singleton
> object is a member of testing, and then you call this foo value in the
> closure func and further in the foreachPartition closure. So following by
> that article, Scala will attempt to serialize the containing object/class
> testing to get the foo instance.
>
> On Thu, Jul 9, 2015 at 4:11 PM, Chen Song  wrote:
>
>> Repost the code example,
>>
>> object testing extends Serializable {
>> object foo {
>>   val v = 42
>> }
>> val list = List(1,2,3)
>> val rdd = sc.parallelize(list)
>> def func = {
>>   val after = rdd.foreachPartition {
>> it => println(foo.v)
>>   }
>> }
>>   }
>>
>> On Thu, Jul 9, 2015 at 4:09 PM, Chen Song  wrote:
>>
>>> Thanks Erik. I saw the document too. That is why I am confused because
>>> as per the article, it should be good as long as *foo *is serializable.
>>> However, what I have seen is that it would work if *testing* is
>>> serializable, even foo is not serializable, as shown below. I don't know if
>>> there is something specific to Spark.
>>>
>>> For example, the code example below works.
>>>
>>> object testing extends Serializable {
>>>
>>> object foo {
>>>
>>>   val v = 42
>>>
>>> }
>>>
>>> val list = List(1,2,3)
>>>
>>> val rdd = sc.parallelize(list)
>>>
>>> def func = {
>>>
>>>   val after = rdd.foreachPartition {
>>>
>>> it => println(foo.v)
>>>
>>>   }
>>>
>>> }
>>>
>>>   }
>>>
>>> On Thu, Jul 9, 2015 at 12:11 PM, Erik Erlandson  wrote:
>>>
 I think you have stumbled across this idiosyncrasy:


 http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/




 - Original Message -
 > I am not sure this is more of a question for Spark or just Scala but
 I am
 > posting my question here.
 >
 > The code snippet below shows an example of passing a reference to a
 closure
 > in rdd.foreachPartition method.
 >
 > ```
 > object testing {
 > object foo extends Serializable {
 >   val v = 42
 > }
 > val list = List(1,2,3)
 > val rdd = sc.parallelize(list)
 > def func = {
 >   val after = rdd.foreachPartition {
 > it => println(foo.v)
 >   }
 > }
 >   }
 > ```
 > When running this code, I got an exception
 >
 > ```
 > Caused by: java.io.NotSerializableException:
 > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$
 > Serialization stack:
 > - object not serializable (class:
 > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$, value:
 > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$@10b7e824)
 > - field (class:
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
 > name: $outer, type: class
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$)
 > - object (class
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
 > )
 > ```
 >
 > It looks like Spark needs to serialize `testing` object. Why is it
 > serializing testing even though I only pass foo (another serializable
 > object) in the closure?
 >
 > A more general question is, how can I prevent Spark from serializing
 the
 > parent class where RDD is defined, with still support of passing in
 > function defined in other classes?
 >
 > --
 > Chen Song
 >

>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>>
>> --
>> Chen Song
>>
>>
>
>
> --
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>


Re: spark-submit

2015-07-10 Thread Andrew Or
Hi Ashutosh, I believe the class is
org.apache.spark.*examples.*graphx.Analytics?
If you're running page rank on live journal you could just use
org.apache.spark.examples.graphx.LiveJournalPageRank.

-Andrew

2015-07-10 3:42 GMT-07:00 AshutoshRaghuvanshi <
ashutosh.raghuvans...@gmail.com>:

> when I do run this command:
>
> ashutosh@pas-lab-server7:~/spark-1.4.0$ ./bin/spark-submit \
> > --class org.apache.spark.graphx.lib.Analytics \
> > --master spark://172.17.27.12:7077 \
> > assembly/target/scala-2.10/spark-assembly-1.4.0-hadoop2.2.0.jar \
> > pagerank soc-LiveJournal1.txt --numEPart=100 --nverts=4847571
> --numIter=10
> > --partStrategy=EdgePartition2D
>
> I get an error:
>
> java.lang.ClassNotFoundException: org.apache.spark.graphx.lib.Analytics
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at
>
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:633)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/07/10 15:31:35 INFO Utils: Shutdown hook called
>
>
>
> where is this class, what path should I give?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-tp23761.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Starting Spark-Application without explicit submission to cluster?

2015-07-10 Thread Andrew Or
Hi Jan,

Most SparkContext constructors are there for legacy reasons. The point of
going through spark-submit is to set up all the classpaths, system
properties, and resolve URIs properly *with respect to the deployment mode*.
For instance, jars are distributed differently between YARN cluster mode
and standalone client mode, and this is not something the Spark user should
have to worry about.

As an example, if you pass jars through the SparkContext constructor, it
won't actually work in cluster mode if the jars are local. This is because
the driver is launched on the cluster and the SparkContext will try to find
the jars on the cluster in vain.

So the more concise answer to your question is: yes technically you don't
need to go through spark-submit, but you'll have to deal with all the
bootstrapping complexity yourself.

-Andrew

2015-07-10 3:37 GMT-07:00 algermissen1971 :

> Hi,
>
> I am a bit confused about the steps I need to take to start a Spark
> application on a cluster.
>
> So far I had this impression from the documentation that I need to
> explicitly submit the application using for example spark-submit.
>
> However, from the SparkContext constructur signature I get the impression
> that maybe I do not have to do that after all:
>
> In
> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.SparkContext
> the first constructor has (among other things) a parameter 'jars' which
> indicates the "Collection of JARs to send to the cluster".
>
> To me this suggests that I can simply start the application anywhere and
> that it will deploy itself to the cluster in the same way a call to
> spark-submit would.
>
> Is that correct?
>
> If not, can someone explain why I can / need to provide master and jars
> etc. in the call to SparkContext because they essentially only duplicate
> what I would specify in the call to spark-submit.
>
> Jan
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to restrict disk space for spark caches on yarn?

2015-07-10 Thread Andrew Or
Hi Peter,

AFAIK Spark assumes infinite disk space, so there isn't really a way to
limit how much space it uses. Unfortunately I'm not aware of a simpler
workaround than to simply provision your cluster with more disk space. By
the way, are you sure that it's disk space that exceeded the limit, but not
the number of inodes? If it's the latter, maybe you could control the
ulimit of the container.

To answer your other question: if it can't persist to disk then yes it will
fail. It will only recompute from the data source if for some reason
someone evicted our blocks from memory, but that shouldn't happen in your
case since your'e using MEMORY_AND_DISK_SER.

-Andrew


2015-07-10 3:51 GMT-07:00 Peter Rudenko :

>  Hi, i have a spark ML worklflow. It uses some persist calls. When i
> launch it with 1 tb dataset - it puts down all cluster, becauses it fills
> all disk space at /yarn/nm/usercache/root/appcache:
> http://i.imgur.com/qvRUrOp.png
>
> I found a yarn settings:
> *yarn*.nodemanager.localizer.*cache*.target-size-mb - Target size of
> localizer cache in MB, per nodemanager. It is a target retention size that
> only includes resources with PUBLIC and PRIVATE visibility and excludes
> resources with APPLICATION visibility
>
> But it excludes resources with APPLICATION visibility, and spark cache as
> i understood is of APPLICATION type.
>
> Is it possible to restrict a disk space for spark application? Will spark
> fail if it wouldn't be able to persist on disk
> (StorageLevel.MEMORY_AND_DISK_SER) or it would recompute from data source?
>
> Thanks,
> Peter Rudenko
>
>
>
>
>


Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Andrew Or
Yeah, we could make it a log a warning instead.

2015-07-15 14:29 GMT-07:00 Kelly, Jonathan :

>  Thanks! Is there an existing JIRA I should watch?
>
>
>  ~ Jonathan
>
>   From: Sandy Ryza 
> Date: Wednesday, July 15, 2015 at 2:27 PM
> To: Jonathan Kelly 
> Cc: "user@spark.apache.org" 
> Subject: Re: Unable to use dynamicAllocation if spark.executor.instances
> is set in spark-defaults.conf
>
>   Hi Jonathan,
>
>  This is a problem that has come up for us as well, because we'd like
> dynamic allocation to be turned on by default in some setups, but not break
> existing users with these properties.  I'm hoping to figure out a way to
> reconcile these by Spark 1.5.
>
>  -Sandy
>
> On Wed, Jul 15, 2015 at 3:18 PM, Kelly, Jonathan 
> wrote:
>
>>   Would there be any problem in having spark.executor.instances (or
>> --num-executors) be completely ignored (i.e., even for non-zero values) if
>> spark.dynamicAllocation.enabled is true (i.e., rather than throwing an
>> exception)?
>>
>>  I can see how the exception would be helpful if, say, you tried to pass
>> both "-c spark.executor.instances" (or --num-executors) *and* "-c
>> spark.dynamicAllocation.enabled=true" to spark-submit on the command line
>> (as opposed to having one of them in spark-defaults.conf and one of them in
>> the spark-submit args), but currently there doesn't seem to be any way to
>> distinguish between arguments that were actually passed to spark-submit and
>> settings that simply came from spark-defaults.conf.
>>
>>  If there were a way to distinguish them, I think the ideal situation
>> would be for the validation exception to be thrown only if
>> spark.executor.instances and spark.dynamicAllocation.enabled=true were both
>> passed via spark-submit args or were both present in spark-defaults.conf,
>> but passing spark.dynamicAllocation.enabled=true to spark-submit would take
>> precedence over spark.executor.instances configured in spark-defaults.conf,
>> and vice versa.
>>
>>
>>  Jonathan Kelly
>>
>> Elastic MapReduce - SDE
>>
>> Blackfoot (SEA33) 06.850.F0
>>
>>   From: Jonathan Kelly 
>> Date: Tuesday, July 14, 2015 at 4:23 PM
>> To: "user@spark.apache.org" 
>> Subject: Unable to use dynamicAllocation if spark.executor.instances is
>> set in spark-defaults.conf
>>
>>I've set up my cluster with a pre-calcualted value for
>> spark.executor.instances in spark-defaults.conf such that I can run a job
>> and have it maximize the utilization of the cluster resources by default.
>> However, if I want to run a job with dynamicAllocation (by passing -c
>> spark.dynamicAllocation.enabled=true to spark-submit), I get this exception:
>>
>>  Exception in thread "main" java.lang.IllegalArgumentException:
>> Explicitly setting the number of executors is not compatible with
>> spark.dynamicAllocation.enabled!
>> at
>> org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
>> at
>> org.apache.spark.deploy.yarn.ClientArguments.(ClientArguments.scala:59)
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
>>  …
>>
>>  The exception makes sense, of course, but ideally I would like it to
>> ignore what I've put in spark-defaults.conf for spark.executor.instances if
>> I've enabled dynamicAllocation. The most annoying thing about this is that
>> if I have spark.executor.instances present in spark-defaults.conf, I cannot
>> figure out any way to spark-submit a job with
>> spark.dynamicAllocation.enabled=true without getting this error. That is,
>> even if I pass "-c spark.executor.instances=0 -c
>> spark.dynamicAllocation.enabled=true", I still get this error because the
>> validation in ClientArguments.parseArgs() that's checking for this
>> condition simply checks for the presence of spark.executor.instances rather
>> than whether or not its value is > 0.
>>
>>  Should the check be changed to allow spark.executor.instances to be set
>> to 0 if spark.dynamicAllocation.enabled is true? That would be an OK
>> compromise, but I'd really prefer to be able to enable dynamicAllocation
>> simply by setting spark.dynamicAllocation.enabled=true rather than by also
>> having to set spark.executor.instances to 0.
>>
>>
>>  Thanks,
>>
>> Jonathan
>>
>
>


Re: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Or
Hi all,

Did you forget to restart the node managers after editing yarn-site.xml by
any chance?

-Andrew

2015-07-17 8:32 GMT-07:00 Andrew Lee :

> I have encountered the same problem after following the document.
>
> Here's my spark-defaults.conf
>
> spark.shuffle.service.enabled true
> spark.dynamicAllocation.enabled  true
> spark.dynamicAllocation.executorIdleTimeout 60
> spark.dynamicAllocation.cachedExecutorIdleTimeout 120
> spark.dynamicAllocation.initialExecutors 2
> spark.dynamicAllocation.maxExecutors 8
> spark.dynamicAllocation.minExecutors 1
> spark.dynamicAllocation.schedulerBacklogTimeout 10
>
>
>
> and yarn-site.xml configured.
>
> 
> yarn.nodemanager.aux-services
> spark_shuffle,mapreduce_shuffle
> 
> ...
> 
> yarn.nodemanager.aux-services.spark_shuffle.class
> org.apache.spark.network.yarn.YarnShuffleService
> 
>
>
> and deployed the 2 JARs to NodeManager's classpath
> /opt/hadoop/share/hadoop/mapreduce/. (I also checked the NodeManager log
> and the JARs appear in the classpath). I notice that the JAR location is
> not the same as the document in 1.4. I found them under network/yarn/target
> and network/shuffle/target/ after building it with "-Phadoop-2.4 -Psparkr
> -Pyarn -Phive -Phive-thriftserver" in maven.
>
>
> spark-network-yarn_2.10-1.4.1.jar
>
> spark-network-shuffle_2.10-1.4.1.jar
>
>
> and still getting the following exception.
>
> Exception in thread "ContainerLauncher #0" java.lang.Error: 
> org.apache.spark.SparkException: Exception while starting container 
> container_1437141440985_0003_01_02 on host 
> alee-ci-2058-slave-2.test.altiscale.com
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Exception while starting 
> container container_1437141440985_0003_01_02 on host 
> alee-ci-2058-slave-2.test.altiscale.com
>   at 
> org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:116)
>   at 
> org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:67)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   ... 2 more
> Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
> auxService:spark_shuffle does not exist
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
>
>
> Not sure what else am I missing here or doing wrong?
>
> Appreciate any insights or feedback, thanks.
>
>
> --
> Date: Wed, 8 Jul 2015 09:25:39 +0800
> Subject: Re: The auxService:spark_shuffle does not exist
> From: zjf...@gmail.com
> To: rp...@njit.edu
> CC: user@spark.apache.org
>
>
> Did you enable the dynamic resource allocation ? You can refer to this
> page for how to configure spark shuffle service for yarn.
>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html
>
>
> On Tue, Jul 7, 2015 at 10:55 PM, roy  wrote:
>
> we tried "--master yarn-client" with no different result.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Andrew Or
Hi Yuval, if you start the Workers with `spark.shuffle.service.enabled =
true` then the workers will each start a shuffle service automatically. No
need to start the shuffle services yourself separately.

-Andrew

2016-03-08 11:21 GMT-08:00 Silvio Fiorito :

> There’s a script to start it up under sbin, start-shuffle-service.sh. Run
> that on each of your worker nodes.
>
>
>
>
>
>
>
> *From: *Yuval Itzchakov 
> *Sent: *Tuesday, March 8, 2016 2:17 PM
> *To: *Silvio Fiorito ;
> user@spark.apache.org
> *Subject: *Re: Using dynamic allocation and shuffle service in Standalone
> Mode
>
>
> Actually, I assumed that setting the flag in the spark job would turn on
> the shuffle service in the workers. I now understand that assumption was
> wrong.
>
> Is there any way to set the flag via the driver? Or must I manually set it
> via spark-env.sh on each worker?
>
>
> On Tue, Mar 8, 2016, 20:14 Silvio Fiorito 
> wrote:
>
>> You’ve started the external shuffle service on all worker nodes, correct?
>> Can you confirm they’re still running and haven’t exited?
>>
>>
>>
>>
>>
>>
>>
>> *From: *Yuval.Itzchakov 
>> *Sent: *Tuesday, March 8, 2016 12:41 PM
>> *To: *user@spark.apache.org
>> *Subject: *Using dynamic allocation and shuffle service in Standalone
>> Mode
>>
>>
>> Hi,
>> I'm using Spark 1.6.0, and according to the documentation, dynamic
>> allocation and spark shuffle service should be enabled.
>>
>> When I submit a spark job via the following:
>>
>> spark-submit \
>> --master  \
>> --deploy-mode cluster \
>> --executor-cores 3 \
>> --conf "spark.streaming.backpressure.enabled=true" \
>> --conf "spark.dynamicAllocation.enabled=true" \
>> --conf "spark.dynamicAllocation.minExecutors=2" \
>> --conf "spark.dynamicAllocation.maxExecutors=24" \
>> --conf "spark.shuffle.service.enabled=true" \
>> --conf "spark.executor.memory=8g" \
>> --conf "spark.driver.memory=10g" \
>> --class SparkJobRunner
>>
>> /opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar
>>
>> I'm seeing error logs from the workers being unable to connect to the
>> shuffle service:
>>
>> 16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to
>> external
>> shuffle server, will retry 2 more times after waiting 5 seconds...
>> java.io.IOException: Failed to connect to 
>> at
>>
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>> at
>>
>> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>> at
>>
>> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>> at
>>
>> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
>> at
>> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> at
>>
>> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
>> at
>> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
>> at org.apache.spark.executor.Executor.(Executor.scala:85)
>> at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
>> at
>>
>> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>> at
>>
>> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> I verified all relevant ports are open. Has anyone else experienced such a
>> failure?
>>
>> Yuval.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: No event log in /tmp/spark-events

2016-03-08 Thread Andrew Or
Hi Patrick,

I think he means just write `/tmp/sparkserverlog` instead of
`file:/tmp/sparkserverlog`. However, I think both should work. What mode
are you running in, client mode (the default) or cluster mode? If the
latter your driver will be run on the cluster, and so your event logs won't
be on the machine you ran spark-submit from. Also, are you running
standalone, YARN or Mesos?

As Jeff commented above, if event log is in fact enabled you should see the
log message from EventLoggingListener. If the log message is not present in
your driver logs, it's likely that the configurations in your
spark-defaults.conf are not passed correctly.

-Andrew

2016-03-03 19:57 GMT-08:00 PatrickYu :

> alvarobrandon wrote
> > Just write /tmp/sparkserverlog without the file part.
>
> I don't get your point, what's mean of 'without the file part'
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/No-event-log-in-tmp-spark-events-tp26318p26394.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Limit of application submission to cluster

2015-12-18 Thread Andrew Or
Hi Saif, have you verified that the cluster has enough resources for all 4
programs?

-Andrew

2015-12-18 5:52 GMT-08:00 :

> Hello everyone,
>
> I am testing some parallel program submission to a stand alone cluster.
> Everything works alright, the problem is, for some reason, I can’t submit
> more than 3 programs to the cluster.
> The fourth one, whether legacy or REST, simply hangs until one of the
> first three completes.
> I am not sure how to debug this, I have tried to increase the number of
> connections per peer or akka number of threads with no luck, any ideas?
>
> Thanks,
> Saif
>
>


Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Andrew Or
Hi Roy,

I believe Spark just gets its application ID from YARN, so you can just do
`sc.applicationId`.

-Andrew

2015-12-18 0:14 GMT-08:00 Deepak Sharma :

> I have never tried this but there is yarn client api's that you can use in
> your spark program to get the application id.
> Here is the link to the yarn client java doc:
>
> http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html
> getApplications() is the method for your purpose here.
>
> Thanks
> Deepak
>
>
> On Fri, Dec 18, 2015 at 1:31 PM, Kyle Lin  wrote:
>
>> Hello there
>>
>> I have the same requirement.
>>
>> I submit a streaming job with yarn-cluster mode.
>>
>> If I want to shutdown this endless YARN application, I should find out
>> the application id by myself and use "yarn appplication -kill " to
>> kill the application.
>>
>> Therefore, if I can get returned application id in my client program, it
>> will be easy for me to kill YARN application from my client program.
>>
>> Kyle
>>
>>
>>
>> 2015-06-24 13:02 GMT+08:00 canan chen :
>>
>>> I don't think there is yarn related stuff to access in spark.  Spark
>>> don't depend on yarn.
>>>
>>> BTW, why do you want the yarn application id ?
>>>
>>> On Mon, Jun 22, 2015 at 11:45 PM, roy  wrote:
>>>
 Hi,

   Is there a way to get Yarn application ID inside spark application,
 when
 running spark Job on YARN ?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-application-ID-for-Spark-job-on-Yarn-tp23429.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: imposed dynamic resource allocation

2015-12-18 Thread Andrew Or
Hi Antony,

The configuration to enable dynamic allocation is per-application.

If you only wish to enable this for one of your applications, just set
`spark.dynamicAllocation.enabled` to true for that application only. The
way it works under the hood is that application will start sending requests
to the AM asking for executors. If you did not enable this config, your
application will not make such requests.

-Andrew

2015-12-11 14:01 GMT-08:00 Antony Mayi :

> Hi,
>
> using spark 1.5.2 on yarn (client mode) and was trying to use the dynamic
> resource allocation but it seems once it is enabled by first app then any
> following application is managed that way even if explicitly disabling.
>
> example:
> 1) yarn configured with org.apache.spark.network.yarn.YarnShuffleService
> as spark_shuffle aux class
> 2) running first app that doesnt specify dynamic allocation / shuffle
> service - it runs as expected with static executors
> 3) running second application that enables spark.dynamicAllocation.enabled
> and spark.shuffle.service.enabled - it is dynamic as expected
> 4) running another app that doesnt enable and it even disables dynamic
> allocation / shuffle service still the executors are being added/removed
> dynamically throughout the runtime.
> 5) restarting nodemanagers to reset this
>
> Is this known issue or have I missed something? Can the dynamic resource
> allocation be enabled per application?
>
> Thanks,
> Antony.
>


Re: which aws instance type for shuffle performance

2015-12-18 Thread Andrew Or
Hi Rastan,

Unless you're using off-heap memory or starting multiple executors per
machine, I would recommend the r3.2xlarge option, since you don't actually
want gigantic heaps (100GB is more than enough). I've personally run Spark
on a very large scale with r3.8xlarge instances, but I've been using
off-heap, so much of the memory was actually not used.

Yes, if a shuffle file exists locally Spark just reads from disk.

-Andrew

2015-12-15 23:11 GMT-08:00 Rastan Boroujerdi :

> I'm trying to determine whether I should be using 10 r3.8xlarge or 40
> r3.2xlarge. I'm mostly concerned with shuffle performance of the
> application.
>
> If I go with r3.8xlarge I will need to configure 4 worker instances per
> machine to keep the JVM size down. The worker instances will likely contend
> with each other for network and disk I/O if they are on the same machine.
> If I go with 40 r3.2xlarge I will be able to allocate a single worker
> instance per box, allowing each worker instance to have its own dedicated
> network and disk I/O.
>
> Since shuffle performance is heavily impacted by disk and network
> throughput, it seems like going with 40 r3.2xlarge would be the better
> configuration between the two. Is my analysis correct? Are there other
> tradeoffs that I'm not taking into account? Does spark bypass the network
> transfer and read straight from disk if worker instances are on the same
> machine?
>
> Thanks,
>
> Rastan
>


Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
Hi Greg,

It's actually intentional for standalone cluster mode to not upload jars.
One of the reasons why YARN takes at least 10 seconds before running any
simple application is because there's a lot of random overhead (e.g.
putting jars in HDFS). If this missing functionality is not documented
somewhere then we should add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I
have filed https://issues.apache.org/jira/browse/SPARK-12559.

-Andrew

2015-12-29 4:18 GMT-08:00 Greg Hill :

>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.


@Annabel That's not true. There *is* a standalone cluster mode where driver
runs on one of the workers instead of on the client machine. What you're
describing is standalone client mode.

2015-12-29 11:32 GMT-08:00 Annabel Melongo :

> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver
> is *NOT* a thread in ApplicationMaster; use --packages to submit a jar
>
>
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or 
> wrote:
>
>
> Hi Greg,
>
> It's actually intentional for standalone cluster mode to not upload jars.
> One of the reasons why YARN takes at least 10 seconds before running any
> simple application is because there's a lot of random overhead (e.g.
> putting jars in HDFS). If this missing functionality is not documented
> somewhere then we should add that.
>
> Also, the packages problem seems legitimate. Thanks for reporting it. I
> have filed https://issues.apache.org/jira/browse/SPARK-12559.
>
> -Andrew
>
> 2015-12-29 4:18 GMT-08:00 Greg Hill :
>
>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


Re: Opening Dynamic Scaling Executors on Yarn

2015-12-29 Thread Andrew Or
>
> External shuffle service is backward compatible, so if you deployed 1.6
> shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications.


Actually, it just happens to be backward compatible because we didn't
change the shuffle file formats. This may not necessarily be the case
moving forward as Spark offers no such guarantees. Just thought it's worth
clarifying.

2015-12-27 22:34 GMT-08:00 Saisai Shao :

> External shuffle service is backward compatible, so if you deployed 1.6
> shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications.
>
> Thanks
> Saisai
>
> On Mon, Dec 28, 2015 at 2:33 PM, 顾亮亮  wrote:
>
>> Is it possible to support both spark-1.5.1 and spark-1.6.0 on one yarn
>> cluster?
>>
>>
>>
>> *From:* Saisai Shao [mailto:sai.sai.s...@gmail.com]
>> *Sent:* Monday, December 28, 2015 2:29 PM
>> *To:* Jeff Zhang
>> *Cc:* 顾亮亮; user@spark.apache.org; 刘骋昺
>> *Subject:* Re: Opening Dynamic Scaling Executors on Yarn
>>
>>
>>
>> Replace all the shuffle jars and restart the NodeManager is enough, no
>> need to restart NN.
>>
>>
>>
>> On Mon, Dec 28, 2015 at 2:05 PM, Jeff Zhang  wrote:
>>
>> See
>> http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Dec 28, 2015 at 2:00 PM, 顾亮亮  wrote:
>>
>> Hi all,
>>
>>
>>
>> SPARK-3174 (https://issues.apache.org/jira/browse/SPARK-3174) is a
>> useful feature to save resources on yarn.
>>
>> We want to open this feature on our yarn cluster.
>>
>> I have a question about the version of shuffle service.
>>
>>
>>
>> I’m now using spark-1.5.1 (shuffle service).
>>
>> If I want to upgrade to spark-1.6.0, should I replace the shuffle service
>> jar and restart all the namenode on yarn ?
>>
>>
>>
>> Thanks a lot.
>>
>>
>>
>> Mars
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>
>


Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo :

> Greg,
>
> Can you please send me a doc describing the standalone cluster mode?
> Honestly, I never heard about it.
>
> The three different modes, I've listed appear in the last paragraph of
> this doc: Running Spark Applications
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
>
>
>
>
>
>
> Running Spark Applications
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
> --class The FQCN of the class containing the main method of the
> application. For example, org.apache.spark.examples.SparkPi. --conf
> View on www.cloudera.com
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
> Preview by Yahoo
>
>
>
>
> On Tuesday, December 29, 2015 2:42 PM, Andrew Or 
> wrote:
>
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>
> @Annabel That's not true. There *is* a standalone cluster mode where
> driver runs on one of the workers instead of on the client machine. What
> you're describing is standalone client mode.
>
> 2015-12-29 11:32 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver
> is *NOT* a thread in ApplicationMaster; use --packages to submit a jar
>
>
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or 
> wrote:
>
>
> Hi Greg,
>
> It's actually intentional for standalone cluster mode to not upload jars.
> One of the reasons why YARN takes at least 10 seconds before running any
> simple application is because there's a lot of random overhead (e.g.
> putting jars in HDFS). If this missing functionality is not documented
> somewhere then we should add that.
>
> Also, the packages problem seems legitimate. Thanks for reporting it. I
> have filed https://issues.apache.org/jira/browse/SPARK-12559.
>
> -Andrew
>
> 2015-12-29 4:18 GMT-08:00 Greg Hill :
>
>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>
>


Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
Let me clarify a few things for everyone:

There are three *cluster managers*: standalone, YARN, and Mesos. Each
cluster manager can run in two *deploy modes*, client or cluster. In client
mode, the driver runs on the machine that submitted the application (the
client). In cluster mode, the driver runs on one of the worker machines in
the cluster.

When I say "standalone cluster mode" I am referring to the standalone
cluster manager running in cluster deploy mode.

Here's how the resources are distributed in each mode (omitting Mesos):

*Standalone / YARN client mode. *The driver runs on the client machine
(i.e. machine that ran Spark submit) so it should already have access to
the jars. The executors then pull the jars from an HTTP server started in
the driver.

*Standalone cluster mode. *Spark submit does *not* upload your jars to the
cluster, so all the resources you need must already be on all of the worker
machines. The executors, however, actually just pull the jars from the
driver as in client mode instead of finding it in their own local file
systems.

*YARN cluster mode. *Spark submit *does* upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read from
there. As in other deployments, the executors pull the jars from the driver.


When the docs say "If your application is launched through Spark submit,
then the application jar is automatically distributed to all worker nodes," it
is actually saying that your executors get their jars from the driver. This
is true whether you're running in client mode or cluster mode.

If the docs are unclear (and they seem to be), then we should update them.
I have filed SPARK-12565 <https://issues.apache.org/jira/browse/SPARK-12565>
to track this.

Please let me know if there's anything else I can help clarify.

Cheers,
-Andrew




2015-12-29 13:07 GMT-08:00 Annabel Melongo :

> Andrew,
>
> Now I see where the confusion lays. Standalone cluster mode, your link, is
> nothing but a combination of client-mode and standalone mode, my link,
> without YARN.
>
> But I'm confused by this paragraph in your link:
>
> If your application is launched through Spark submit, then the
> application jar is automatically distributed to all worker nodes. For any
> additional jars that your
>   application depends on, you should specify them through the
> --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).
>
> That can't be true; this is only the case when Spark runs on top of YARN.
> Please correct me, if I'm wrong.
>
> Thanks
>
>
>
> On Tuesday, December 29, 2015 2:54 PM, Andrew Or 
> wrote:
>
>
>
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
>
> 2015-12-29 11:48 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> Can you please send me a doc describing the standalone cluster mode?
> Honestly, I never heard about it.
>
> The three different modes, I've listed appear in the last paragraph of
> this doc: Running Spark Applications
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
>
>
>
>
>
>
> Running Spark Applications
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
> --class The FQCN of the class containing the main method of the
> application. For example, org.apache.spark.examples.SparkPi. --conf
> View on www.cloudera.com
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
> Preview by Yahoo
>
>
>
>
> On Tuesday, December 29, 2015 2:42 PM, Andrew Or 
> wrote:
>
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>
> @Annabel That's not true. There *is* a standalone cluster mode where
> driver runs on one of the workers instead of on the client machine. What
> you're describing is standalone client mode.
>
> 2015-12-29 11:32 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same

Re: Can't submit job to stand alone cluster

2015-12-30 Thread Andrew Or
Hi Jim,

Just to clarify further:

   - *Driver *is the process with SparkContext. A driver represents an
   application (e.g. spark-shell, SparkPi) so there is exactly one driver in
   each application.


   - *Executor *is the process that runs the tasks scheduled by the driver.
   There should be at least one executor in each application.


   - *Master *is the process that handles scheduling of *applications*. It
   decides where drivers and executors are launched and how many cores and how
   much memory to give to each application. This only exists in standalone
   mode.


   - *Worker *is the process that actually launches the executor and driver
   JVMs (the latter only in cluster mode). It talks to the Master to decide
   what to launch with how much memory to give to the process. This only
   exists in standalone mode.

It is actually the *driver*, not the Master, that distributes jars to
executors. The Master is largely unconcerned with individual requirements
from an application apart from cores / memory constraints. This is because
we still need to distribute jars to executors in YARN and Mesos modes, so
the common process, the driver, has to do it.

I thought the whole point of confusion is that people expect the driver to
> distribute jars but they have to be visible to the master on the file
> system local to the master?


Actually the requirement is that the jars have to be visible to the machine
running the *driver*, not the Master. In client mode, your jars have to be
visible to the machine running spark-submit. In cluster mode, your jars
have to be visible to all machines running a Worker, since the driver can
be launched on any of them.

The nice email from Greg is spot-on.

Does that make sense?

-Andrew


2015-12-30 11:23 GMT-08:00 SparkUser :

> Sorry need to clarify:
>
> When you say:
>
> *When the docs say **"If your application is launched through Spark
> submit, then the application jar is automatically distributed to all worker
> nodes,"**it is actually saying that your executors get their jars from
> the driver. This is true whether you're running in client mode or cluster
> mode.*
>
>
> Don't you mean the master, not the driver? I thought the whole point of
> confusion is that people expect the driver to distribute jars but they have
> to be visible to the master on the file system local to the master?
>
> I see a lot of people tripped up by this and a nice mail from Greg Hill to
> the list cleared this up for me but now I am confused again. I am a couple
> days away from having a way to test this myself, so I am just "in theory"
> right now.
>
> On 12/29/2015 05:18 AM, Greg Hill wrote:
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
>
> Thanks,
>
> Jim
>
>
> On 12/29/2015 04:36 PM, Daniel Valdivia wrote:
>
> That makes things more clear! Thanks
>
> Issue resolved
>
> Sent from my iPhone
>
> On Dec 29, 2015, at 2:43 PM, Annabel Melongo < 
> melongo_anna...@yahoo.com> wrote:
>
> Thanks Andrew for this awesome explanation [image: *:) happy]
>
>
> On Tuesday, December 29, 2015 5:30 PM, Andrew Or < 
> and...@databricks.com> wrote:
>
>
> Let me clarify a few things for everyone:
>
> There are three *cluster managers*: standalone, YARN, and Mesos. Each
> cluster manager can run in two *deploy modes*, client or cluster. In
> client mode, the driver runs on the machine that submitted the application
> (the client). In cluster mode, the driver runs on one of the worker
> machines in the cluster.
>
> When I say "standalone cluster mode" I am referring to the standalone
> cluster manager running in cluster deploy mode.
>
> Here's how the resources are distributed in each mode (omitting Mesos):
>
> *Standalone / YARN client mode. *The driver runs on the client machine
> (i.e. machine that ran Spark submit) so it should already have access to
> the jars. The executors then pull the jars from an HTTP server started in
> the driver.
>
> *Standalone cluster mode. *Spark submit does *not* upload your jars to
> the cluster, so all the resources you need must already be on all of the
> worker machines. The executors, however

Re: Read Accumulator value while running

2016-01-13 Thread Andrew Or
Hi Kira,

As you suspected, accumulator values are only updated after the task
completes. We do send accumulator updates from the executors to the driver
on periodic heartbeats, but these only concern internal accumulators, not
the ones created by the user.

In short, I'm afraid there is not currently a way (in Spark 1.6 and before)
to access the accumulator values until after the tasks that updated them
have completed. This will change in Spark 2.0, the next version, however.

Please let me know if you have more questions.
-Andrew

2016-01-13 11:24 GMT-08:00 Daniel Imberman :

> Hi Kira,
>
> I'm having some trouble understanding your question. Could you please give
> a code example?
>
>
>
> From what I think you're asking there are two issues with what you're
> looking to do. (Please keep in mind I could be totally wrong on both of
> these assumptions, but this is what I've been lead to believe)
>
> 1. The contract of an accumulator is that you can't actually read the
> value as the function is performing because the values in the accumulator
> don't actually mean anything until they are reduced. If you were looking
> for progress in a local context, you could do mapPartitions and have a
> local accumulator per partition, but I don't think it's possible to get the
> actual accumulator value in the middle of the map job.
>
> 2. As far as performing ac2 while ac1 is "always running", I'm pretty sure
> that's not possible. The way that lazy valuation works in Spark, the
> transformations have to be done serially. Having it any other way would
> actually be really bad because then you could have ac1 changing the data
> thereby making ac2's output unpredictable.
>
> That being said, with a more specific example it might be possible to help
> figure out a solution that accomplishes what you are trying to do.
>
> On Wed, Jan 13, 2016 at 5:43 AM Kira  wrote:
>
>> Hi,
>>
>> So i have an action on one RDD that is relatively long, let's call it ac1;
>> what i want to do is to execute another action (ac2) on the same RDD to
>> see
>> the evolution of the first one (ac1); for this end i want to use an
>> accumulator and read it's value progressively to see the changes on it (on
>> the fly) while ac1 is always running. My problem is that the accumulator
>> is
>> only updated once the ac1 has been finished, this is not helpful for me
>> :/ .
>>
>> I ve seen  here
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-td15758.html
>> >
>> what may seem like a solution for me but it doesn t work : "While Spark
>> already offers support for asynchronous reduce (collect data from workers,
>> while not interrupting execution of a parallel transformation) through
>> accumulator"
>>
>> Another post suggested to use SparkListner to do that.
>>
>> are these solutions correct ? if yes, give me a simple exemple ?
>> are there other solutions ?
>>
>> thank you.
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Read-Accumulator-value-while-running-tp25960.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Andrew Or
Hi Alex,

Yes, you can set `spark.cleaner.ttl`:
http://spark.apache.org/docs/1.6.0/configuration.html, but I would not
recommend it!

We are actually removing this property in Spark 2.0 because it has caused
problems for many users in the past. In particular, if you accidentally use
a variable that has been automatically cleaned, then you will run into
problems like shuffle fetch failures or broadcast variable not found etc,
which may fail your job.

Alternatively, Spark already automatically cleans up all variables that
have been garbage collected, including RDDs, shuffle dependencies,
broadcast variables and accumulators. This context-based cleaning has been
enabled by default for many versions by now so it should be reliable. The
only caveat is that it may not work super well in a shell environment,
where some variables may never exit the scope.

Please let me know if you have more questions,
-Andrew


2016-01-13 11:36 GMT-08:00 Alexander Pivovarov :

> Is it possible to automatically unpersist RDDs which are not used for 24
> hours?
>


Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Andrew Or
Hi all,

Both the history server and the shuffle service are backward compatible,
but not forward compatible. This means as long as you have the latest
version of history server / shuffle service running in your cluster then
you're fine (you don't need multiple of them).

That said, an old shuffle service (e.g. 1.2) also happens to work with say
Spark 1.4 because the shuffle file formats haven't changed. However, there
are no guarantees that this will remain the case.

-Andrew

2015-10-05 16:37 GMT-07:00 Alex Rovner :

> We are running CDH 5.4 with Spark 1.3 as our main version and that version
> is configured to use the external shuffling service. We have also installed
> Spark 1.5 and have configured it not to use the external shuffling service
> and that works well for us so far. I would be interested myself how to
> configure multiple versions to use the same shuffling service.
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>
> On Mon, Oct 5, 2015 at 11:06 AM, Andreas Fritzler <
> andreas.fritz...@gmail.com> wrote:
>
>> Hi Steve, Alex,
>>
>> how do you handle the distribution and configuration of
>> the spark-*-yarn-shuffle.jar on your NodeManagers if you want to use 2
>> different Spark versions?
>>
>> Regards,
>> Andreas
>>
>> On Mon, Oct 5, 2015 at 4:54 PM, Steve Loughran 
>> wrote:
>>
>>>
>>> > On 5 Oct 2015, at 16:48, Alex Rovner  wrote:
>>> >
>>> > Hey Steve,
>>> >
>>> > Are you referring to the 1.5 version of the history server?
>>> >
>>>
>>>
>>> Yes. I should warn, however, that there's no guarantee that a history
>>> server running the 1.4 code will handle the histories of a 1.5+ job. In
>>> fact, I'm fairly confident it won't, as the events to get replayed are
>>> different.
>>>
>>
>>
>


Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-09 Thread Andrew Or
Hi Tom,

I believe a workaround is to set `spark.dynamicAllocation.initialExecutors`
to 0. As others have mentioned, from Spark 1.5.2 onwards this should no
longer be necessary.

-Andrew

2015-11-09 8:19 GMT-08:00 Jonathan Kelly :

> Tom,
>
> You might be hitting https://issues.apache.org/jira/browse/SPARK-10790,
> which was introduced in Spark 1.5.0 and fixed in 1.5.2. Spark 1.5.2 just
> passed release candidate voting, so it should be tagged, released and
> announced soon. If you are able to build from source yourself and run with
> that, you might want to try building from the v1.5.2-rc2 tag to see if it
> fixes your issue. Otherwise, hopefully Spark 1.5.2 will be available for
> download very soon.
>
> ~ Jonathan
>
> On Mon, Nov 9, 2015 at 6:08 AM, Akhil Das 
> wrote:
>
>> Did you go through
>> http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup
>> for yarn, i guess you will have to copy the spark-1.5.1-yarn-shuffle.jar to
>> the classpath of all nodemanagers in your cluster.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Oct 30, 2015 at 7:41 PM, Tom Stewart <
>> stewartthom...@yahoo.com.invalid> wrote:
>>
>>> I am running the following command on a Hadoop cluster to launch Spark
>>> shell with DRA:
>>> spark-shell  --conf spark.dynamicAllocation.enabled=true --conf
>>> spark.shuffle.service.enabled=true --conf
>>> spark.dynamicAllocation.minExecutors=4 --conf
>>> spark.dynamicAllocation.maxExecutors=12 --conf
>>> spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=120 --conf
>>> spark.dynamicAllocation.schedulerBacklogTimeout=300 --conf
>>> spark.dynamicAllocation.executorIdleTimeout=60 --executor-memory 512m
>>> --master yarn-client --queue default
>>>
>>> This is the code I'm running within the Spark Shell - just demo stuff
>>> from teh web site.
>>>
>>> import org.apache.spark.mllib.clustering.KMeans
>>> import org.apache.spark.mllib.linalg.Vectors
>>>
>>> // Load and parse the data
>>> val data = sc.textFile("hdfs://ns/public/sample/kmeans_data.txt")
>>>
>>> val parsedData = data.map(s => Vectors.dense(s.split('
>>> ').map(_.toDouble))).cache()
>>>
>>> // Cluster the data into two classes using KMeans
>>> val numClusters = 2
>>> val numIterations = 20
>>> val clusters = KMeans.train(parsedData, numClusters, numIterations)
>>>
>>> This works fine on Spark 1.4.1 but is failing on Spark 1.5.1. Did
>>> something change that I need to do differently for DRA on 1.5.1?
>>>
>>> This is the error I am getting:
>>> 15/10/29 21:44:19 WARN YarnScheduler: Initial job has not accepted any
>>> resources; check your cluster UI to ensure that workers are registered and
>>> have sufficient resources
>>> 15/10/29 21:44:34 WARN YarnScheduler: Initial job has not accepted any
>>> resources; check your cluster UI to ensure that workers are registered and
>>> have sufficient resources
>>> 15/10/29 21:44:49 WARN YarnScheduler: Initial job has not accepted any
>>> resources; check your cluster UI to ensure that workers are registered and
>>> have sufficient resources
>>>
>>> That happens to be the same error you get if you haven't followed the
>>> steps to enable DRA, however I have done those and as I said if I just flip
>>> to Spark 1.4.1 on the same cluster it works with my YARN config.
>>>
>>>
>>
>


Re: create a table for csv files

2015-11-19 Thread Andrew Or
There's not an easy way. The closest thing you can do is:

import org.apache.spark.sql.functions._

val df = ...
df.withColumn("id", monotonicallyIncreasingId())

-Andrew

2015-11-19 8:23 GMT-08:00 xiaohe lan :

> Hi,
>
> I have some csv file in HDFS with headers like col1, col2, col3, I want to
> add a column named id, so the a record would be 
>
> How can I do this using Spark SQL ? Can id be auto increment ?
>
> Thanks,
> Xiaohe
>


Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andrew Or
Hi Andy,

You must be running in cluster mode. The Spark Master accepts client mode
submissions on port 7077 and cluster mode submissions on port 6066. This is
because standalone cluster mode uses a REST API to submit applications by
default. If you submit to port 6066 instead the warning should go away.

-Andrew


2015-12-10 18:13 GMT-08:00 Andy Davidson :

> Hi Jakob
>
> The cluster was set up using the spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
> script
>
> Given my limited knowledge I think this looks okay?
>
> Thanks
>
> Andy
>
> $ sudo netstat -peant | grep 7077
>
> tcp0  0 :::172-31-30-51:7077:::*
>  LISTEN  0  311641427355/java
>
> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:57311
> ESTABLISHED 0  311591927355/java
>
> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:42333
> ESTABLISHED 0  373666427355/java
>
> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:49796
> ESTABLISHED 0  311592527355/java
>
> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:42290
> ESTABLISHED 0  311592327355/java
>
>
> $ ps -aux | grep 27355
>
> Warning: bad syntax, perhaps a bogus '-'? See
> /usr/share/doc/procps-3.2.8/FAQ
>
> ec2-user 23867  0.0  0.0 110404   872 pts/0S+   02:06   0:00 grep 27355
>
> root 27355  0.5  6.7 3679096 515836 ?  Sl   Nov26 107:04
> /usr/java/latest/bin/java -cp
> /root/spark/sbin/../conf/:/root/spark/lib/spark-assembly-1.5.1-hadoop1.2.1.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar:/root/ephemeral-hdfs/conf/
> -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip
> ec2-54-215-217-122.us-west-1.compute.amazonaws.com --port 7077
> --webui-port 8080
>
> From: Jakob Odersky 
> Date: Thursday, December 10, 2015 at 5:55 PM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: Warning: Master endpoint spark://ip:7077 was not a REST
> server. Falling back to legacy submission gateway instead.
>
> Is there any other process using port 7077?
>
> On 10 December 2015 at 08:52, Andy Davidson  > wrote:
>
>> Hi
>>
>> I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning.
>> My job seems to run with out any problem.
>>
>> Kind regards
>>
>> Andy
>>
>> + /root/spark/bin/spark-submit --class
>> com.pws.spark.streaming.IngestDriver --master spark://
>> ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077
>> --total-executor-cores 2 --deploy-mode cluster
>> hdfs:///home/ec2-user/build/ingest-all.jar --clusterMode --dirPath week_3
>>
>> Running Spark using the REST application submission protocol.
>>
>> 15/12/10 16:46:33 WARN RestSubmissionClient: Unable to connect to server
>> spark://ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077.
>>
>> Warning: Master endpoint
>> ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077 was not a REST
>> server. Falling back to legacy submission gateway instead.
>>
>
>


Re: Spark job submission REST API

2015-12-10 Thread Andrew Or
Hello,

The hidden API was implemented for use internally and there are no plans to
make it public at this point. It was originally introduced to provide
backward compatibility in submission protocol across multiple versions of
Spark. A full-fledged stable REST API for submitting applications would
require a detailed design consensus among the community.

-Andrew

2015-12-10 8:26 GMT-08:00 mvle :

> Hi,
>
> I would like to use Spark as a service through REST API calls
> for uploading and submitting a job, getting results, etc.
>
> There is a project by the folks at Ooyala:
> https://github.com/spark-jobserver/spark-jobserver
>
> I also encountered some hidden job REST APIs in Spark:
> http://arturmkrtchyan.com/apache-spark-hidden-rest-api
>
> To help determine which set of APIs to use, I would like to know
> the plans for those hidden Spark APIs.
> Will they be made public and supported at some point?
>
> Thanks,
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-submission-REST-API-tp25670.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Or
Hi Andrew,

Based on your driver logs, it seems the issue is that the shuffle service
is actually not running on the NodeManagers, but your application is trying
to provide a "spark_shuffle" secret anyway. One way to verify whether the
shuffle service is actually started is to look at the NodeManager logs for
the following lines:

*Initializing YARN shuffle service for Spark*
*Started YARN shuffle service for Spark on port X*

These should be logged under the INFO level. Also, could you verify whether
*all* the executors have this problem, or just a subset? If even one of the
NM doesn't have the shuffle service, you'll see the stack trace that you
ran into. It would be good to confirm whether the yarn-site.xml change is
actually reflected on all NMs if the log statements above are missing.

Let me know if you can get it working. I've run the shuffle service myself
on the master branch (which will become Spark 1.5.0) recently following the
instructions and have not encountered any problems.

-Andrew


Re: Spark spark.shuffle.memoryFraction has no affect

2015-07-22 Thread Andrew Or
Hi,

The setting of 0.2 / 0.6 looks reasonable to me. Since you are not using
caching at all, have you tried trying something more extreme, like 0.1 /
0.9? Since disabling spark.shuffle.spill didn't cause an OOM this setting
should be fine. Also, one thing you could do is to verify the shuffle bytes
spilled on the UI before and after the change.

Let me know if that helped.
-Andrew

2015-07-21 13:50 GMT-07:00 wdbaruni :

> Hi
> I am testing Spark on Amazon EMR using Python and the basic wordcount
> example shipped with Spark.
>
> After running the application, I realized that in Stage 0 reduceByKey(add),
> around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to
> disk. Since in the wordcount example I am not caching or persisting any
> data, so I thought I can increase the performance of this application by
> giving more shuffle memoryFraction. So, in spark-defaults.conf, I added the
> following:
>
> spark.storage.memoryFraction0.2
> spark.shuffle.memoryFraction0.6
>
> However, I am still getting the same performance and the same amount of
> shuffle data is being spilled to disk and memory. I validated that Spark is
> reading these configurations using Spark UI/Environment and I can see my
> changes. Moreover, I tried setting spark.shuffle.spill to false and I got
> the performance I am looking for and all shuffle data was spilled to memory
> only.
>
> So, what am I getting wrong here and why not the extra shuffle memory
> fraction is not utilized?
>
> *My environment:*
> Amazon EMR with Spark 1.3.1 running using -x argument
> 1 Master node: m3.xlarge
> 3 Core nodes: m3.xlarge
> Application: wordcount.py
> Input: 10 .gz files 90MB each (~350MB unarchived) stored in S3
>
> *Submit command:*
> /home/hadoop/spark/bin/spark-submit --deploy-mode client /mnt/wordcount.py
> s3n://
>
> *spark-defaults.conf:*
> spark.eventLog.enabled  false
> spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
> spark.driver.extraJavaOptions   -Dspark.driver.log.level=INFO
> spark.masteryarn
> spark.executor.instances3
> spark.executor.cores4
> spark.executor.memory   9404M
> spark.default.parallelism   12
> spark.eventLog.enabled  true
> spark.eventLog.dir  hdfs:///spark-logs/
> spark.storage.memoryFraction0.2
> spark.shuffle.memoryFraction0.6
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-spark-shuffle-memoryFraction-has-no-affect-tp23944.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
Hi Dan,

If the map is small enough, you can just broadcast it, can't you? It
doesn't have to be an RDD. Here's an example of broadcasting an array and
using it on the executors:
https://github.com/apache/spark/blob/c03299a18b4e076cabb4b7833a1e7632c5c0dabe/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala
.

-Andrew

2015-07-21 19:56 GMT-07:00 ayan guha :

> Either you have to do rdd.collect and then broadcast or you can do a join
> On 22 Jul 2015 07:54, "Dan Dong"  wrote:
>
>> Hi, All,
>>
>>
>> I am trying to access a Map from RDDs that are on different compute
>> nodes, but without success. The Map is like:
>>
>> val map1 = Map("aa"->1,"bb"->2,"cc"->3,...)
>>
>> All RDDs will have to check against it to see if the key is in the Map or
>> not, so seems I have to make the Map itself global, the problem is that if
>> the Map is stored as RDDs and spread across the different nodes, each node
>> will only see a piece of the Map and the info will not be complete to check
>> against the Map( an then replace the key with the corresponding value) E,g:
>>
>> val matchs= Vecs.map(term=>term.map{case (a,b)=>(map1(a),b)})
>>
>> But if the Map is not an RDD, how to share it like sc.broadcast(map1)
>>
>> Any idea about this? Thanks!
>>
>>
>> Cheers,
>> Dan
>>
>>


Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or
Hi,

It would be whatever's left in the JVM. This is not explicitly controlled
by a fraction like storage or shuffle. However, the computation usually
doesn't need to use that much space. In my experience it's almost always
the caching or the aggregation during shuffles that's the most memory
intensive.

-Andrew

2015-07-21 13:47 GMT-07:00 wdbaruni :

> I am new to Spark and I understand that Spark divides the executor memory
> into the following fractions:
>
> *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
> .cache() and can be defined by setting spark.storage.memoryFraction
> (default
> 0.6)
>
> *Shuffle and aggregation buffers:* Which Spark uses to store shuffle
> outputs. It can defined using spark.shuffle.memoryFraction. If shuffle
> output exceeds this fraction, then Spark will spill data to disk (default
> 0.2)
>
> *User code:* Spark uses this fraction to execute arbitrary user code
> (default 0.2)
>
> I am not mentioning the storage and shuffle safety fractions for
> simplicity.
>
> My question is, which memory fraction is Spark using to compute and
> transform RDDs that are not going to be persisted? For example:
>
> lines = sc.textFile("i am a big file.txt")
> count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
> 1)).reduceByKey(add)
> count.saveAsTextFile("output")
>
> Here Spark will not load the whole file at once and will partition the
> input
> file and do all these transformations per partition in a single stage.
> However, which memory fraction Spark will use to load the partitioned
> lines,
> compute flatMap() and map()?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Which-memory-fraction-is-Spark-using-to-compute-RDDs-that-are-not-going-to-be-persisted-tp23942.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or
Hi Srikanth,

It does look like a bug. Did you set `spark.executor.cores` in your
application by any chance?

-Andrew

2015-07-22 8:05 GMT-07:00 Srikanth :

> Hello,
>
> I've set spark.deploy.spreadOut=false in spark-env.sh.
>
>> export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4
>> -Dspark.deploy.spreadOut=false"
>
>
> There are 3 workers each with 4 cores. Spark-shell was started with noof
> cores = 6.
> Spark UI show that one executor was used with 6 cores.
>
> Is this a bug? This is with Spark 1.4.
>
> [image: Inline image 1]
>
> Srikanth
>


Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or
Hi Srikanth,

I was able to reproduce the issue by setting `spark.cores.max` to a number
greater than the number of cores on a worker. I've filed SPARK-9260 which I
believe is already being fixed in https://github.com/apache/spark/pull/7274.

Thanks for reporting the issue!
-Andrew

2015-07-22 11:49 GMT-07:00 Andrew Or :

> Hi Srikanth,
>
> It does look like a bug. Did you set `spark.executor.cores` in your
> application by any chance?
>
> -Andrew
>
> 2015-07-22 8:05 GMT-07:00 Srikanth :
>
>> Hello,
>>
>> I've set spark.deploy.spreadOut=false in spark-env.sh.
>>
>>> export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4
>>> -Dspark.deploy.spreadOut=false"
>>
>>
>> There are 3 workers each with 4 cores. Spark-shell was started with noof
>> cores = 6.
>> Spark UI show that one executor was used with 6 cores.
>>
>> Is this a bug? This is with Spark 1.4.
>>
>> [image: Inline image 1]
>>
>> Srikanth
>>
>
>


Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
Hi Dan,

`map2` is a broadcast variable, not your map. To access the map on the
executors you need to do `map2.value(a)`.

-Andrew

2015-07-22 12:20 GMT-07:00 Dan Dong :

> Hi, Andrew,
>   If I broadcast the Map:
> val map2=sc.broadcast(map1)
>
> I will get compilation error:
> org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[Int,String]]
> does not take parameters
> [error]  val matchs= Vecs.map(term=>term.map{case (a,b)=>(map2(a),b)})
>
> Seems it's still an RDD, so how to access it by value=map2(key) ? Thanks!
>
> Cheers,
> Dan
>
>
>
> 2015-07-22 2:20 GMT-05:00 Andrew Or :
>
>> Hi Dan,
>>
>> If the map is small enough, you can just broadcast it, can't you? It
>> doesn't have to be an RDD. Here's an example of broadcasting an array and
>> using it on the executors:
>> https://github.com/apache/spark/blob/c03299a18b4e076cabb4b7833a1e7632c5c0dabe/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala
>> .
>>
>> -Andrew
>>
>> 2015-07-21 19:56 GMT-07:00 ayan guha :
>>
>>> Either you have to do rdd.collect and then broadcast or you can do a join
>>> On 22 Jul 2015 07:54, "Dan Dong"  wrote:
>>>
>>>> Hi, All,
>>>>
>>>>
>>>> I am trying to access a Map from RDDs that are on different compute
>>>> nodes, but without success. The Map is like:
>>>>
>>>> val map1 = Map("aa"->1,"bb"->2,"cc"->3,...)
>>>>
>>>> All RDDs will have to check against it to see if the key is in the Map
>>>> or not, so seems I have to make the Map itself global, the problem is that
>>>> if the Map is stored as RDDs and spread across the different nodes, each
>>>> node will only see a piece of the Map and the info will not be complete to
>>>> check against the Map( an then replace the key with the corresponding
>>>> value) E,g:
>>>>
>>>> val matchs= Vecs.map(term=>term.map{case (a,b)=>(map1(a),b)})
>>>>
>>>> But if the Map is not an RDD, how to share it like sc.broadcast(map1)
>>>>
>>>> Any idea about this? Thanks!
>>>>
>>>>
>>>> Cheers,
>>>> Dan
>>>>
>>>>
>>
>


Re: spark.executor.memory and spark.driver.memory have no effect in yarn-cluster mode (1.4.x)?

2015-07-22 Thread Andrew Or
Hi Michael,

In general, driver related properties should not be set through the
SparkConf. This is because by the time the SparkConf is created, we have
already started the driver JVM, so it's too late to change the memory,
class paths and other properties.

In cluster mode, executor related properties should also not be set through
the SparkConf. This is because the driver is run on the cluster just like
the executors, and the executors are launched independently by whatever the
cluster manager (e.g. YARN) is configured to do.

The recommended way of setting these properties is either through the
conf/spark-defaults.conf properties file, or through the spark-submit
command line, e.g.:

bin/spark-shell --master yarn --executor-memory 2g --driver-memory 5g

Let me know if that answers your question,
-Andrew


2015-07-22 12:38 GMT-07:00 Michael Misiewicz :

> Hi group,
>
> I seem to have encountered a weird problem with 'spark-submit' and
> manually setting sparkconf values in my applications.
>
> It seems like setting the configuration values spark.executor.memory
> and spark.driver.memory don't have any effect, when they are set from
> within my application (i.e. prior to creating a SparkContext).
>
> In yarn-cluster mode, only the values specified on the command line via
> spark-submit for driver and executor memory are respected, and if not, it
> appears spark falls back to defaults. For example,
>
> Correct behavior noted in Driver's logs on YARN when --executor-memory is
> specified:
>
> 15/07/22 19:25:59 INFO yarn.YarnAllocator: Will request 200 executor 
> containers, each with 1 cores and 13824 MB memory including 1536 MB overhead
> 15/07/22 19:25:59 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>
>
> But not when spark.executor.memory is specified prior to spark context 
> initialization:
>
> 15/07/22 19:22:22 INFO yarn.YarnAllocator: Will request 200 executor 
> containers, each with 1 cores and 2560 MB memory including 1536 MB overhead
> 15/07/22 19:22:22 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>
>
> In both cases, executor mem should be 10g. Interestingly, I set a parameter 
> spark.yarn.executor.memoryOverhead which appears to be respected whether or 
> not I'm in yarn-cluster or yarn-client mode.
>
>
> Has anyone seen this before? Any idea what might be causing this behavior?
>
>


Re: No event logs in yarn-cluster mode

2015-08-01 Thread Andrew Or
Hi Akmal,

It might be on HDFS, since you provided a relative path
/opt/spark/spark-events to `spark.eventLog.dir`.

-Andrew

2015-08-01 9:25 GMT-07:00 Akmal Abbasov :

> Hi, I am trying to configure a history server for application.
> When I running locally(./run-example SparkPi), the event logs are being
> created, and I can start history server.
> But when I am trying
> ./spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-cluster file:///opt/hadoop/spark/examples/src/main/python/pi.py
> I am getting
> 15/08/01 18:18:50 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: 192.168.56.192
> ApplicationMaster RPC port: 0
> queue: default
> start time: 1438445890676
> final status: SUCCEEDED
> tracking URL: http://sp-m1:8088/proxy/application_1438444529840_0009/A
> user: hadoop
> 15/08/01 18:18:50 INFO util.Utils: Shutdown hook called
> 15/08/01 18:18:50 INFO util.Utils: Deleting directory
> /tmp/spark-185f7b83-cb3b-4134-a10c-452366204f74
> So it is succeeded, but there is no event logs for this application.
>
> here are my configs
> spark-defaults.conf
> spark.master yarn-cluster
> spark.eventLog.dir   /opt/spark/spark-events
> spark.eventLog.enabled  true
>
> spark-env.sh
> export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop"
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
> -Dspark.deploy.zookeeper.url=“zk1:2181,zk2:2181”
> export
> SPARK_HISTORY_OPTS="-Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
> -Dspark.history.fs.logDirectory=file:/opt/spark/spark-events
> -Dspark.history.fs.cleaner.enabled=true"
>
> Any ideas?
>
> Thank you
>


Re: Spark master driver UI: How to keep it after process finished?

2015-08-08 Thread Andrew Or
Hi Saif,

You need to run your application with `spark.eventLog.enabled` set to true.
Then if you are using standalone mode, you can view the Master UI at port
8080. Otherwise, you may start a history server through
`sbin/start-history-server.sh`, which by default starts the history UI at
port 18080.

For more information on how to set this up, visit:
http://spark.apache.org/docs/latest/monitoring.html

-Andrew


2015-08-07 13:16 GMT-07:00 François Pelletier <
newslett...@francoispelletier.org>:

>
> look at
> spark.history.ui.port, if you use standalone
> spark.yarn.historyServer.address, if you use YARN
>
> in your Spark config file
>
> Mine is located at
> /etc/spark/conf/spark-defaults.conf
>
> If you use Apache Ambari you can find this settings in the Spark / Configs
> / Advanced spark-defaults tab
>
> François
>
>
> Le 2015-08-07 15:58, saif.a.ell...@wellsfargo.com a écrit :
>
> Hello, thank you, but that port is unreachable for me. Can you please
> share where can I find that port equivalent in my environment?
>
>
>
> Thank you
>
> Saif
>
>
>
> *From:* François Pelletier [mailto:newslett...@francoispelletier.org
> ]
> *Sent:* Friday, August 07, 2015 4:38 PM
> *To:* user@spark.apache.org
> *Subject:* Re: Spark master driver UI: How to keep it after process
> finished?
>
>
>
> Hi, all spark processes are saved in the Spark History Server
>
> look at your host on port 18080 instead of 4040
>
> François
>
> Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit :
>
> Hi,
>
>
>
> A silly question here. The Driver Web UI dies when the spark-submit
> program finish. I would like some time to analyze after the program ends,
> as the page does not refresh it self, when I hit F5 I lose all the info.
>
>
>
> Thanks,
>
> Saif
>
>
>
>
>
>
>


Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread Andrew Or
Hi Canan, TestSQLContext is no longer a singleton but now a class. It is
never meant to be a fully public API, but if you wish to use it you can
just instantiate a new one:

val sqlContext = new TestSQLContext

or just create a new SQLContext from a SparkContext.

-Andrew

2015-08-15 20:33 GMT-07:00 canan chen :

> I am not sure other people's spark debugging environment ( I mean for the
> master branch) , Anyone can share his experience ?
>
>
> On Sun, Aug 16, 2015 at 10:40 AM, canan chen  wrote:
>
>> I import the spark source code to intellij, and want to run SparkPi in
>> intellij, but meet the folliwing weird compilation error? I googled it and
>> sbt clean doesn't work for me. I am not sure whether anyone else has meet
>> this issue also, any help is appreciated
>>
>> Error:scalac:
>>  while compiling:
>> /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
>> during phase: jvm
>>  library version: version 2.10.4
>> compiler version: version 2.10.4
>>   reconstructed args: -nobootcp -javabootclasspath : -deprecation
>> -feature -classpath
>>
>
>


Re: Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread Andrew Or
Hi Canan,

This is mainly for legacy reasons. The default behavior in standalone in
mode is that the application grabs all available resources in the cluster.
This effectively means we want one executor per worker, where each executor
grabs all the available cores and memory on that worker. In this model, it
doesn't really make sense to express number of executors, because that's
equivalent to the number of workers.

In 1.4+, however, we do support multiple executors per worker, but that's
not the default so we decided not to add support for the --num-executors
setting to avoid potential confusion.

-Andrew


2015-08-18 2:35 GMT-07:00 canan chen :

> num-executor only works for yarn mode. In standalone mode, I have to set
> the --total-executor-cores and --executor-cores. Isn't this way so
> intuitive ? Any reason for that ?
>


Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Andrew Or
Hi Axel,

You can try setting `spark.deploy.spreadOut` to false (through your
conf/spark-defaults.conf file). What this does is essentially try to
schedule as many cores on one worker as possible before spilling over to
other workers. Note that you *must* restart the cluster through the sbin
scripts.

For more information see:
http://spark.apache.org/docs/latest/spark-standalone.html.

Feel free to let me know whether it works,
-Andrew


2015-08-18 4:49 GMT-07:00 Igor Berman :

> by default standalone creates 1 executor on every worker machine per
> application
> number of overall cores is configured with --total-executor-cores
> so in general if you'll specify --total-executor-cores=1 then there would
> be only 1 core on some executor and you'll get what you want
>
> on the other hand, if you application needs all cores of your cluster and
> only some specific job should run on single executor there are few methods
> to achieve this
> e.g. coallesce(1) or dummyRddWithOnePartitionOnly.foreachPartition
>
>
> On 18 August 2015 at 01:36, Axel Dahl  wrote:
>
>> I have a 4 node cluster and have been playing around with the
>> num-executors parameters, executor-memory and executor-cores
>>
>> I set the following:
>> --executor-memory=10G
>> --num-executors=1
>> --executor-cores=8
>>
>> But when I run the job, I see that each worker, is running one executor
>> which has  2 cores and 2.5G memory.
>>
>> What I'd like to do instead is have Spark just allocate the job to a
>> single worker node?
>>
>> Is that possible in standalone mode or do I need a job/resource scheduler
>> like Yarn to do that?
>>
>> Thanks in advance,
>>
>> -Axel
>>
>>
>>
>


Re: Programmatically create SparkContext on YARN

2015-08-18 Thread Andrew Or
Hi Andreas,

I believe the distinction is not between standalone and YARN mode, but
between client and cluster mode.

In client mode, your Spark submit JVM runs your driver code. In cluster
mode, one of the workers (or NodeManagers if you're using YARN) in the
cluster runs your driver code. In the latter case, it doesn't really make
sense to call `setMaster` in your driver because Spark needs to know which
cluster you're submitting the application to.

Instead, the recommended way is to set the master through the `--master`
flag in the command line, e.g.

$ bin/spark-submit
--master spark://1.2.3.4:7077
--class some.user.Clazz
--name "My app name"
--jars lib1.jar,lib2.jar
--deploy-mode cluster
app.jar

Both YARN and standalone modes support client and cluster modes, and the
spark-submit script is the common interface through which you can launch
your application. In other words, you shouldn't have to do anything more
than providing a different value to `--master` to use YARN.

-Andrew

2015-08-17 0:34 GMT-07:00 Andreas Fritzler :

> Hi all,
>
> when runnig the Spark cluster in standalone mode I am able to create the
> Spark context from Java via the following code snippet:
>
> SparkConf conf = new SparkConf()
>>.setAppName("MySparkApp")
>>.setMaster("spark://SPARK_MASTER:7077")
>>.setJars(jars);
>> JavaSparkContext sc = new JavaSparkContext(conf);
>
>
> As soon as I'm done with my processing, I can just close it via
>
>> sc.stop();
>>
> Now my question: Is the same also possible when running Spark on YARN? I
> currently don't see how this should be possible without submitting your
> application as a packaged jar file. Is there a way to get this kind of
> interactivity from within your Scala/Java code?
>
> Regards,
> Andrea
>


Re: Difference between Sort based and Hash based shuffle

2015-08-18 Thread Andrew Or
Hi Muhammad,

On a high level, in hash-based shuffle each mapper M writes R shuffle
files, one for each reducer where R is the number of reduce partitions.
This results in M * R shuffle files. Since it is not uncommon for M and R
to be O(1000), this quickly becomes expensive. An optimization with
hash-based shuffle is consolidation, where all mappers run in the same core
C write one file per reducer, resulting in C * R files. This is a strict
improvement, but it is still relatively expensive.

Instead, in sort-based shuffle each mapper writes a single partitioned
file. This allows a particular reducer to request a specific portion of
each mapper's single output file. In more detail, the mapper first fills up
an internal buffer in memory and continually spills the contents of the
buffer to disk, then finally merges all the spilled files together to form
one final output file. This places much less stress on the file system and
requires much fewer I/O operations especially on the read side.

-Andrew



2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <11besemja...@seecs.edu.pk>
:

> I did check it out and although I did get a general understanding of the
> various classes used to implement Sort and Hash shuffles, however these
> slides lack details as to how they are implemented and why sort generally
> has better performance than hash
>
> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran 
> wrote:
>
>> Have a look at this presentation.
>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>> of help to you.
>>
>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>> 11besemja...@seecs.edu.pk> wrote:
>>
>>> What are the major differences between how Sort based and Hash based
>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>> than Hash?
>>> Any talks that discuss both shuffles in detail, how they are implemented
>>> and the performance gains ?
>>>
>>
>>
>


Re: dse spark-submit multiple jars issue

2015-08-18 Thread Andrew Or
Hi Satish,

The problem is that `--jars` accepts a comma-delimited list of jars! E.g.

spark-submit ... --jars lib1.jar,lib2.jar,lib3.jar main.jar

where main.jar is your main application jar (the one that starts a
SparkContext), and lib*.jar refer to additional libraries that your main
application jar uses.

-Andrew

2015-08-13 3:22 GMT-07:00 Javier Domingo Cansino :

> Please notice that 'jars: null'
>
> I don't know why you put ///. but I would propose you just put normal
> absolute paths.
>
> dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
> --jars /home/missingmerch/postgresql-9.4-1201.jdbc41.jar
> /home/missingmerch/dse.jar
> /home/missingmerch/spark-cassandra-connector-java_2.10-1.1.1.jar
> /home/missingmerch/etl-0.0.1-SNAPSHOT.jar
>
> Hope this is helpful!
>
> [image: Fon] Javier Domingo CansinoResearch &
> Development Engineer+34 946545847Skype: javier.domingo.fonAll information
> in this email is confidential 
>
> On Tue, Aug 11, 2015 at 3:42 PM, satish chandra j <
> jsatishchan...@gmail.com> wrote:
>
>> HI,
>>
>> Please find the log details below:
>>
>>
>> dse spark-submit --verbose --master local --class HelloWorld
>> etl-0.0.1-SNAPSHOT.jar --jars
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>> file:/home/missingmerch/dse.jar
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>>
>> Using properties file: /etc/dse/spark/spark-defaults.conf
>>
>> Adding default property:
>> spark.cassandra.connection.factory=com.datastax.bdp.spark.DseCassandraConnectionFactory
>>
>> Adding default property: spark.ssl.keyStore=.keystore
>>
>> Adding default property: spark.ssl.enabled=false
>>
>> Adding default property: spark.ssl.trustStore=.truststore
>>
>> Adding default property:
>> spark.cassandra.auth.conf.factory=com.datastax.bdp.spark.DseAuthConfFactory
>>
>> Adding default property: spark.ssl.keyPassword=cassandra
>>
>> Adding default property: spark.ssl.keyStorePassword=cassandra
>>
>> Adding default property: spark.ssl.protocol=TLS
>>
>> Adding default property: spark.ssl.useNodeLocalConf=true
>>
>> Adding default property: spark.ssl.trustStorePassword=cassandra
>>
>> Adding default property:
>> spark.ssl.enabledAlgorithms=TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA
>>
>> Parsed arguments:
>>
>>   master  local
>>
>>   deployMode  null
>>
>>   executorMemory  null
>>
>>   executorCores   null
>>
>>   totalExecutorCores  null
>>
>>   propertiesFile  /etc/dse/spark/spark-defaults.conf
>>
>>   driverMemory512M
>>
>>   driverCores null
>>
>>   driverExtraClassPathnull
>>
>>   driverExtraLibraryPath  null
>>
>>   driverExtraJavaOptions  -Dcassandra.username=missingmerch
>> -Dcassandra.password=STMbrjrlb -XX:MaxPermSize=256M
>>
>>   supervise   false
>>
>>   queue   null
>>
>>   numExecutorsnull
>>
>>   files   null
>>
>>   pyFiles null
>>
>>   archivesnull
>>
>>   mainClass   HelloWorld
>>
>>   primaryResource file:/home/missingmerch/etl-0.0.1-SNAPSHOT.jar
>>
>>   nameHelloWorld
>>
>>   childArgs   [--jars
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>> file:/home/missingmerch/dse.jar
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar]
>>
>>   jarsnull
>>
>>   verbose true
>>
>>
>>
>> Spark properties used, including those specified through
>>
>> --conf and those from the properties file
>> /etc/dse/spark/spark-defaults.conf:
>>
>>   spark.cassandra.connection.factory ->
>> com.datastax.bdp.spark.DseCassandraConnectionFactory
>>
>>   spark.ssl.useNodeLocalConf -> true
>>
>>   spark.ssl.enabled -> false
>>
>>   spark.executor.extraJavaOptions -> -XX:MaxPermSize=256M
>>
>>   spark.ssl.keyStore -> .keystore
>>
>>   spark.ssl.trustStore -> .truststore
>>
>>   spark.ssl.trustStorePassword -> cassandra
>>
>>   spark.ssl.enabledAlgorithms ->
>> TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA
>>
>>   spark.cassandra.auth.conf.factory ->
>> com.datastax.bdp.spark.DseAuthConfFactory
>>
>>   spark.ssl.protocol -> TLS
>>
>>   spark.ssl.keyPassword -> cassandra
>>
>>   spark.ssl.keyStorePassword -> cassandra
>>
>>
>>
>>
>>
>> Main class:
>>
>> HelloWorld
>>
>> Arguments:
>>
>> --jars
>>
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>>
>> file:/home/missingmerch/dse.jar
>>
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>>
>> System properties:
>>
>> spark.cassandra.connection.factory ->
>> com.datastax.bdp.spark.DseCassandraConnectionFactory
>>
>> spark.driver.memory -> 512M
>>
>> spark.ssl.useNodeLocalConf -> true
>>
>> spark.ssl.enabled -> false
>>
>> SPARK_SUBMIT -> true
>>
>> spark.executor.extraJavaOptions -> -XX:MaxPermSize=256M
>>
>> spark.app.name -> HelloWorld
>>
>> spark.ssl.enable

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Andrew Or
Yes, in other words, a "bucket" is a single file in hash-based shuffle (no
consolidation), but a segment of partitioned file in sort-based shuffle.

2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed <11besemja...@seecs.edu.pk>:

> Thanks Andrew for a detailed response,
>
> So the reason why key value pairs with same keys are always found in a
> single buckets in Hash based shuffle but not in Sort is because in
> sort-shuffle each mapper writes a single partitioned file, and it is up to
> the reducer to fetch correct partitions from the the files ?
>
> On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or  wrote:
>
>> Hi Muhammad,
>>
>> On a high level, in hash-based shuffle each mapper M writes R shuffle
>> files, one for each reducer where R is the number of reduce partitions.
>> This results in M * R shuffle files. Since it is not uncommon for M and R
>> to be O(1000), this quickly becomes expensive. An optimization with
>> hash-based shuffle is consolidation, where all mappers run in the same core
>> C write one file per reducer, resulting in C * R files. This is a strict
>> improvement, but it is still relatively expensive.
>>
>> Instead, in sort-based shuffle each mapper writes a single partitioned
>> file. This allows a particular reducer to request a specific portion of
>> each mapper's single output file. In more detail, the mapper first fills up
>> an internal buffer in memory and continually spills the contents of the
>> buffer to disk, then finally merges all the spilled files together to form
>> one final output file. This places much less stress on the file system and
>> requires much fewer I/O operations especially on the read side.
>>
>> -Andrew
>>
>>
>>
>> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <
>> 11besemja...@seecs.edu.pk>:
>>
>>> I did check it out and although I did get a general understanding of the
>>> various classes used to implement Sort and Hash shuffles, however these
>>> slides lack details as to how they are implemented and why sort generally
>>> has better performance than hash
>>>
>>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran 
>>> wrote:
>>>
>>>> Have a look at this presentation.
>>>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>>>> of help to you.
>>>>
>>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>>> 11besemja...@seecs.edu.pk> wrote:
>>>>
>>>>> What are the major differences between how Sort based and Hash based
>>>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>>>> than Hash?
>>>>> Any talks that discuss both shuffles in detail, how they are
>>>>> implemented and the performance gains ?
>>>>>
>>>>
>>>>
>>>
>>
>


Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Andrew Or
Hi Axel, what spark version are you using? Also, what do your
configurations look like for the following?

- spark.cores.max (also --total-executor-cores)
- spark.executor.cores (also --executor-cores)


2015-08-19 9:27 GMT-07:00 Axel Dahl :

> hmm maybe I spoke too soon.
>
> I have an apache zeppelin instance running and have configured it to use
> 48 cores (each node only has 16 cores), so I figured by setting it to 48,
> would mean that spark would grab 3 nodes.  what happens instead though is
> that spark, reports that 48 cores are being used, but only executes
> everything on 1 node, it looks like it's not grabbing the extra nodes.
>
> On Wed, Aug 19, 2015 at 8:43 AM, Axel Dahl  wrote:
>
>> That worked great, thanks Andrew.
>>
>> On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or  wrote:
>>
>>> Hi Axel,
>>>
>>> You can try setting `spark.deploy.spreadOut` to false (through your
>>> conf/spark-defaults.conf file). What this does is essentially try to
>>> schedule as many cores on one worker as possible before spilling over to
>>> other workers. Note that you *must* restart the cluster through the sbin
>>> scripts.
>>>
>>> For more information see:
>>> http://spark.apache.org/docs/latest/spark-standalone.html.
>>>
>>> Feel free to let me know whether it works,
>>> -Andrew
>>>
>>>
>>> 2015-08-18 4:49 GMT-07:00 Igor Berman :
>>>
>>>> by default standalone creates 1 executor on every worker machine per
>>>> application
>>>> number of overall cores is configured with --total-executor-cores
>>>> so in general if you'll specify --total-executor-cores=1 then there
>>>> would be only 1 core on some executor and you'll get what you want
>>>>
>>>> on the other hand, if you application needs all cores of your cluster
>>>> and only some specific job should run on single executor there are few
>>>> methods to achieve this
>>>> e.g. coallesce(1) or dummyRddWithOnePartitionOnly.foreachPartition
>>>>
>>>>
>>>> On 18 August 2015 at 01:36, Axel Dahl  wrote:
>>>>
>>>>> I have a 4 node cluster and have been playing around with the
>>>>> num-executors parameters, executor-memory and executor-cores
>>>>>
>>>>> I set the following:
>>>>> --executor-memory=10G
>>>>> --num-executors=1
>>>>> --executor-cores=8
>>>>>
>>>>> But when I run the job, I see that each worker, is running one
>>>>> executor which has  2 cores and 2.5G memory.
>>>>>
>>>>> What I'd like to do instead is have Spark just allocate the job to a
>>>>> single worker node?
>>>>>
>>>>> Is that possible in standalone mode or do I need a job/resource
>>>>> scheduler like Yarn to do that?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> -Axel
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread Andrew Or
Hi Canan,

The event log dir is a per-application setting whereas the history server
is an independent service that serves history UIs from many applications.
If you use history server locally then the `spark.history.fs.logDirectory`
will happen to point to `spark.eventLog.dir`, but the use case it provides
is broader than that.

-Andrew

2015-08-19 5:13 GMT-07:00 canan chen :

> Anyone know about this ? Or do I miss something here ?
>
> On Fri, Aug 7, 2015 at 4:20 PM, canan chen  wrote:
>
>> Is there any reason that historyserver use another property for the event
>> log dir ? Thanks
>>
>
>


Re: DAG related query

2015-08-20 Thread Andrew Or
Hi Bahubali,

Once RDDs are created, they are immutable (in most cases). In your case you
end up with 3 RDDs:

(1) the original rdd1 that reads from the text file
(2) rdd2, that applies a map function on (1), and
(3) the new rdd1 that applies a map function on (2)

There's no cycle because you have 3 distinct RDDs. All you're doing is
reassigning a reference `rdd1`, but the underlying RDD doesn't change.

-Andrew

2015-08-20 6:21 GMT-07:00 Sean Owen :

> No. The third line creates a third RDD whose reference simply replaces
> the reference to the first RDD in your local driver program. The first
> RDD still exists.
>
> On Thu, Aug 20, 2015 at 2:15 PM, Bahubali Jain  wrote:
> > Hi,
> > How would the DAG look like for the below code
> >
> > JavaRDD rdd1 = context.textFile();
> > JavaRDD rdd2 = rdd1.map();
> > rdd1 =  rdd2.map();
> >
> > Does this lead to any kind of cycle?
> >
> > Thanks,
> > Baahu
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark ec2 lunch problem

2015-08-24 Thread Andrew Or
Hey Garry,

Have you verified that your particular VPC and subnet are open to the
world? In particular, have you verified the route table attached to your
VPC / subnet contains an internet gateway open to the public?

I've run into this issue myself recently and that was the problem for me.

-Andrew

2015-08-24 5:58 GMT-07:00 Robin East :

> spark-ec2 is the way to go however you may need to debug connectivity
> issues. For example do you know that the servers were correctly setup in
> AWS and can you access each node using ssh? If no then you need to work out
> why (it’s not a spark issue). If yes then you will need to work out why ssh
> via the spark-ec2 script is not working.
>
> I’ve used spark-ec2 successfully many times but have never used the
> —vpc-id and —subnet-id options and that may be the source of your problems,
> especially since it appears to be a hostname resolution issue. If you could
> confirm the above questions then maybe someone on the list can help
> diagnose the specific problem.
>
>
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/malak/
>
> On 24 Aug 2015, at 13:45, Garry Chen  wrote:
>
> So what is the best way to deploy spark cluster in EC2 environment any
> suggestions?
>
> Garry
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com
> ]
> *Sent:* Friday, August 21, 2015 4:27 PM
> *To:* Garry Chen 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark ec2 lunch problem
>
>
> It may happen that the version of spark-ec2 script you are using is buggy
> or sometime AWS have problem provisioning machines.
> On Aug 21, 2015 7:56 AM, "Garry Chen"  wrote:
>
> Hi All,
> I am trying to lunch a spark ec2 cluster by running
>  spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc
> --subnet-id=subnet-011 --spark-version=1.4.1 launch spark-cluster but
> getting following message endless.  Please help.
>
>
> Warning: SSH connection error. (This could be temporary.)
> Host:
> SSH return code: 255
> SSH output: ssh: Could not resolve hostname : Name or service not known
>
>
>


Re: Why are executors on slave never used?

2015-09-21 Thread Andrew Or
Hi Joshua,

What cluster manager are you using, standalone or YARN? (Note that
standalone here does not mean local mode).

If standalone, you need to do `setMaster("spark://[CLUSTER_URL]:7077")`,
where CLUSTER_URL is the machine that started the standalone Master. If
YARN, you need to do `setMaster("yarn")`, assuming that all the Hadoop
configurations files such as core-site.xml are already set up properly.

-Andrew


2015-09-21 8:53 GMT-07:00 Hemant Bhanawat :

> When you specify master as local[2], it starts the spark components in a
> single jvm. You need to specify the master correctly.
> I have a default AWS EMR cluster (1 master, 1 slave) with Spark. When I
> run a Spark process, it works fine -- but only on the master, as if it were
> standalone.
>
> The web-UI and logging code shows only 1 executor, the localhost.
>
> How can I diagnose this?
>
> (I create *SparkConf, *in Python, with *setMaster('local[2]'). )*
>
> (Strangely, though I don't think that this causes the problem, there is
> almost nothing spark-related on the slave machine:* /usr/lib/spark *has a
> few jars, but that's it:  *datanucleus-api-jdo.jar  datanucleus-core.jar
>  datanucleus-rdbms.jar  spark-yarn-shuffle.jar. *But this is an AWS EMR
> cluster as created by* create-cluster*, so I would assume that the slave
> and master are configured OK out-of the box.)
>
> Joshua
>


Re: PySpark Client

2015-01-20 Thread Andrew Or
Hi Chris,

Short answer is no, not yet.

Longer answer is that PySpark only supports client mode, which means your
driver runs on the same machine as your submission client. By corollary
this means your submission client must currently depend on all of Spark and
its dependencies. There is a patch that supports this for *cluster* mode
(as opposed to client mode), which would be the first step towards what you
want.

-Andrew

2015-01-20 8:36 GMT-08:00 Chris Beavers :

> Hey all,
>
> Is there any notion of a lightweight python client for submitting jobs to
> a Spark cluster remotely? If I essentially install Spark on the client
> machine, and that machine has the same OS, same version of Python, etc.,
> then I'm able to communicate with the cluster just fine. But if Python
> versions differ slightly, then I start to see a lot of opaque errors that
> often bubble up as EOFExceptions. Furthermore, this just seems like a very
> heavy weight way to set up a client.
>
> Does anyone have any suggestions for setting up a thin pyspark client on a
> node which doesn't necessarily conform to the homogeneity of the target
> Spark cluster?
>
> Best,
> Chris
>


Re: spark-submit --py-files remote: "Only local additional python files are supported"

2015-01-20 Thread Andrew Or
Hi Vladimir,

Yes, as the error messages suggests, PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

-Andrew

2015-01-20 7:38 GMT-08:00 Vladimir Grigor :

> Hi all!
>
> I found this problem when I tried running python application on Amazon's
> EMR yarn cluster.
>
> It is possible to run bundled example applications on EMR but I cannot
> figure out how to run a little bit more complex python application which
> depends on some other python scripts. I tried adding those files with
> '--py-files' and it works fine in local mode but it fails and gives me
> following message when run in EMR:
> "Error: Only local python files are supported:
> s3://pathtomybucket/mylibrary.py".
>
> Simplest way to reproduce in local:
> bin/spark-submit --py-files s3://whatever.path.com/library.py main.py
>
> Actual commands to run it in EMR
> #launch cluster
> aws emr create-cluster --name SparkCluster --ami-version 3.3.1
> --instance-type m1.medium --instance-count 2  --ec2-attributes
> KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
> --enable-debugging --use-default-roles  --bootstrap-action
> Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=["-s","
> http://pathtomybucket/bootstrap-actions/spark
> ","-l","WARN","-v","1.2","-b","2014121700","-x"]
> #{
> #   "ClusterId": "j-2Y58DME79MPQJ"
> #}
>
> #run application
> aws emr add-steps --cluster-id "j-2Y58DME79MPQJ" --steps
> ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
> #{
> #"StepIds": [
> #"s-2UP4PP75YX0KU"
> #]
> #}
> And in stderr of that step I get "Error: Only local python files are
> supported: s3://pathtomybucket/tasks/demo/main.py".
>
> What is the workaround or correct way to do it? Using hadoop's distcp to
> copy dependency files from s3 to nodes as another pre-step?
>
> Regards, Vladimir
>


Re: Aggregate order semantics when spilling

2015-01-20 Thread Andrew Or
Hi Justin,

I believe the intended semantics of groupByKey or cogroup is that the
ordering *within a key *is not preserved if you spill. In fact, the test
cases for the ExternalAppendOnlyMap only assert that the Set representation
of the results is as expected (see this line
).
This is because these Spark primitives literally just group the values by a
key but does not provide any ordering guarantees.

However, if ordering within a key is a requirement for your application,
then you may need to write your own PairRDDFunction that calls
combineByKey. You can model your method after groupByKey, but change the
combiner function slightly to take ordering into account. This may add some
overhead to your application since you need to insert every value in the
appropriate place, but since you're spilling anyway the overhead will
likely be shadowed by disk I/O.

Let me know if that works.
-Andrew


2015-01-20 9:18 GMT-08:00 Justin Uang :

> Hi,
>
> I am trying to aggregate a key based on some timestamp, and I believe that
> spilling to disk is changing the order of the data fed into the combiner.
>
> I have some timeseries data that is of the form: ("key", "date", "other
> data")
>
> Partition 1
> ("A", 2, ...)
> ("B", 4, ...)
> ("A", 1, ...)
> ("A", 3, ...)
> ("B", 6, ...)
>
> which I then partition by key, then sort within the partition:
>
> Partition 1
> ("A", 1, ...)
> ("A", 2, ...)
> ("A", 3, ...)
> ("A", 4, ...)
>
> Partition 2
> ("B", 4, ...)
> ("B", 6, ...)
>
> If I run a combineByKey with the same partitioner, then the items for each
> key will be fed into the ExternalAppendOnlyMap in the correct order.
> However, if I spill, then the time slices are spilled to disk as multiple
> partial combiners. When its time to merge the spilled combiners for each
> key, the combiners are combined in the wrong order.
>
> For example, if during a groupByKey, [("A", 1, ...), ("A", 2...)] and
> [("A", 3, ...), ("A", 4, ...)] are spilled separately, it's possible that
> the combiners can be combined in the wrong order, like [("A", 3, ...),
> ("A", 4, ...), ("A", 1, ...), ("A", 2, ...)], which invalidates the
> invariant that all the values for A are passed in order to the combiners.
>
> I'm not an expert, but I suspect that this is because we use a heap
> ordered by key when iterating, which doesn't retain the order the spilled
> combiners. Perhaps we can order our mergeHeap by (hash_key, spill_index),
> where spill_index is incremented each time we spill? This would mean that
> we would pop and merge the combiners of each key in order, resulting in
> [("A", 1, ...), ("A", 2, ...), ("A", 3, ...), ("A", 4, ...)].
>
> Thanks in advance for the help! If there is a way to do this already in
> Spark 1.2, can someone point it out to me?
>
> Best,
>
> Justin
>


Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-11 Thread Andrew Or
Hi Jianshi,

For YARN, there may be an issue with how a recently patch changes the
accessibility of the shuffle files by the external shuffle service:
https://issues.apache.org/jira/browse/SPARK-5655. It is likely that you
will hit this with 1.2.1, actually. For this reason I would have to
recommend that you use 1.2.2 when it is released, but for now you should
use 1.2.0 for this specific use case.

-Andrew

2015-02-10 23:38 GMT-08:00 Reynold Xin :

> I think we made the binary protocol compatible across all versions, so you
> should be fine with using any one of them. 1.2.1 is probably the best since
> it is the most recent stable release.
>
> On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang 
> wrote:
>
>> Hi,
>>
>> I need to use branch-1.2 and sometimes master builds of Spark for my
>> project. However the officially supported Spark version by our Hadoop admin
>> is only 1.2.0.
>>
>> So, my question is which version/build of spark-yarn-shuffle.jar should I
>> use that works for all four versions? (1.2.0, 1.2.1, 1.2,2, 1.3.0)
>>
>> Thanks,
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Andrew Or
>> I asked several people, no one seems to believe that we can do this:
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark

That is because people usually don't package python files into their jars.
For pyspark, however, this will work as long as the jar can be opened and
its contents can be read. In my experience, if I am able to import the
pyspark module by explicitly specifying the PYTHONPATH this way, then I can
run pyspark on YARN without fail.

>> > OK, my colleague found this:
>> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
>> >
>> > And my jar file has 70011 files. Fantastic..

It seems that this problem is not specific to running Java 6 on a Java 7
jar. We definitely need to document and warn against Java 7 jars more
aggressively. For now, please do try building the jar with Java 6.



2014-06-03 4:42 GMT+02:00 Patrick Wendell :

> Yeah we need to add a build warning to the Maven build. Would you be
> able to try compiling Spark with Java 6? It would be good to narrow
> down if you hare hitting this problem or something else.
>
> On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen  wrote:
> > Nope... didn't try java 6. The standard installation guide didn't say
> > anything about java 7 and suggested to do "-DskipTests" for the build..
> > http://spark.apache.org/docs/latest/building-with-maven.html
> >
> > So, I didn't see the warning message...
> >
> >
> > On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell 
> wrote:
> >>
> >> Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
> >> Zip format and Java 7 uses Zip64. I think we've tried to add some
> >> build warnings if Java 7 is used, for this reason:
> >>
> >> https://github.com/apache/spark/blob/master/make-distribution.sh#L102
> >>
> >> Any luck if you use JDK 6 to compile?
> >>
> >>
> >> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen 
> >> wrote:
> >> > OK, my colleague found this:
> >> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
> >> >
> >> > And my jar file has 70011 files. Fantastic..
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen 
> >> > wrote:
> >> >>
> >> >> I asked several people, no one seems to believe that we can do this:
> >> >> $ PYTHONPATH=/path/to/assembly/jar python
> >> >> >>> import pyspark
> >> >>
> >> >> This following pull request did mention something about generating a
> >> >> zip
> >> >> file for all python related modules:
> >> >> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
> >> >>
> >> >> I've tested that zipped modules can as least be imported via
> zipimport.
> >> >>
> >> >> Any ideas?
> >> >>
> >> >> -Simon
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or 
> >> >> wrote:
> >> >>>
> >> >>> Hi Simon,
> >> >>>
> >> >>> You shouldn't have to install pyspark on every worker node. In YARN
> >> >>> mode,
> >> >>> pyspark is packaged into your assembly jar and shipped to your
> >> >>> executors
> >> >>> automatically. This seems like a more general problem. There are a
> few
> >> >>> things to try:
> >> >>>
> >> >>> 1) Run a simple pyspark shell with yarn-client, and do
> >> >>> "sc.parallelize(range(10)).count()" to see if you get the same error
> >> >>> 2) If so, check if your assembly jar is compiled correctly. Run
> >> >>>
> >> >>> $ jar -tf  pyspark
> >> >>> $ jar -tf  py4j
> >> >>>
> >> >>> to see if the files are there. For Py4j, you need both the python
> >> >>> files
> >> >>> and the Java class files.
> >> >>>
> >> >>> 3) If the files are there, try running a simple python shell (not
> >> >>> pyspark
> >> >>> shell) with the assembly jar on the PYTHONPATH:
> >> >>>
> >> >>> $ PYTHONPATH=/path/to/assembly/jar python
> >> >>> >>> import pyspark
> >

Re: Spark Logging

2014-06-10 Thread Andrew Or
You can import org.apache.spark.Logging, and use logInfo, logWarning etc.
Besides viewing them from the Web console, the location of the logs can be
found under $SPARK_HOME/logs, on both the driver and executor machines. (If
you are on YARN, these logs are located elsewhere, however.)


2014-06-10 11:39 GMT-07:00 Robert James :

> How can I write to Spark's logs from my client code?
> What are the options to view those logs?
> Besides the Web console, is there a way to read and grep the file?
>


Re: Can't find pyspark when using PySpark on YARN

2014-06-10 Thread Andrew Or
Hi Qi Ping,

You don't have to distribute these files; they are automatically packaged
in the assembly jar, which is already shipped to the worker nodes.

Other people have run into the same issue. See if the instructions here are
of any help:
http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3ccamjob8mr1+ias-sldz_rfrke_na2uubnmhrac4nukqyqnun...@mail.gmail.com%3e

As described in the link, the last resort is to try building your assembly
jar with JAVA_HOME set to Java 6. This usually fixes the problem (more
details in the link provided).

Cheers,
Andrew


2014-06-10 6:35 GMT-07:00 李奇平 :

> Dear all,
>
> When I submit a pyspark application using this command:
>
> ./bin/spark-submit --master yarn-client
> examples/src/main/python/wordcount.py "hdfs://..."
>
> I get the following exception:
>
> Error from python worker:
> Traceback (most recent call last):
> File "/usr/ali/lib/python2.5/runpy.py", line 85, in run_module
> loader = get_loader(mod_name)
> File "/usr/ali/lib/python2.5/pkgutil.py", line 456, in get_loader
> return find_loader(fullname)
> File "/usr/ali/lib/python2.5/pkgutil.py", line 466, in find_loader
> for importer in iter_importers(fullname):
> File "/usr/ali/lib/python2.5/pkgutil.py", line 422, in iter_importers
> __import__(pkg)
> ImportError: No module named pyspark
> PYTHONPATH was:
>
> /home/xxx/spark/python:/home/xxx/spark_on_yarn/python/lib/py4j-0.8.1-src.zip:/disk11/mapred/tmp/usercache//filecache/11/spark-assembly-1.0.0-hadoop2.0.0-ydh2.0.0.jar
>
> Maybe `pyspark/python` and `py4j-0.8.1-src.zip` is not included in the
> YARN worker,
> How can I distribute these files with my application? Can I use `--pyfiles
> python.zip, py4j-0.8.1-src.zip `?
> Or how can I package modules in pyspark to a .egg file?
>
>
>
>


Re: problem starting the history server on EC2

2014-06-10 Thread Andrew Or
Can you try file:/root/spark_log?


2014-06-10 19:22 GMT-07:00 zhen :

> I checked the permission on root and it is the following:
>
> drwxr-xr-x 20 root root  4096 Jun 11 01:05 root
>
> So anyway, I changed to use /tmp/spark_log instead and this time I made
> sure
> that all permissions are given to /tmp and /tmp/spark_log like below. But
> it
> still does not work:
>
> drwxrwxrwt  8 root root  4096 Jun 11 02:08 tmp
> drwxrwxrwx 2 root root   4096 Jun 11 02:08 spark_log
>
> Thanks
>
> Zhen
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/problem-starting-the-history-server-on-EC2-tp7361p7370.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: problem starting the history server on EC2

2014-06-10 Thread Andrew Or
No, I meant pass the path to the history server start script.


2014-06-10 19:33 GMT-07:00 zhen :

> Sure here it is:
>
> drwxrwxrwx  2 1000 root 4096 Jun 11 01:05 spark_logs
>
> Zhen
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/problem-starting-the-history-server-on-EC2-tp7361p7373.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Andrew Or
This is a known issue: https://issues.apache.org/jira/browse/SPARK-1919. We
haven't found a fix yet, but for now, you can workaround this by including
your simple class in your application jar.


2014-06-11 10:25 GMT-07:00 Ulanov, Alexander :

>  Hi,
>
>
>
> I am currently using spark 1.0 locally on Windows 7. I would like to use
> classes from external jar in the spark-shell. I followed the instruction
> in:
> http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCALrNVjWWF6k=c_jrhoe9w_qaacjld4+kbduhfv0pitr8h1f...@mail.gmail.com%3E
>
>
>
> I have set ADD_JARS=”my.jar” SPARK_CLASSPATH=”my.jar” in spark-shell.cmd
> but this didn’t work.
>
>
>
> I also tried running “spark-shell.cmd --jars my.jar --driver-class-path
> my.jar --driver-library-path my.jar” and it didn’t work either.
>
>
>
> I cannot load any class from my jar into spark shell. Btw my.jar contains
> a simple Scala class.
>
>
>
> Best regards, Alexander
>


Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Andrew Or
Ah, of course, there are no application jars in spark-shell, then it seems
that there are no workarounds for this at the moment. We will look into a
fix shortly, but for now you will have to create an application and use
spark-submit (or use spark-shell on Linux).


2014-06-11 10:42 GMT-07:00 Ulanov, Alexander :

>  Could you elaborate on this? I don’t have an application, I just use
> spark shell.
>
>
>
> *From:* Andrew Or [mailto:and...@databricks.com]
> *Sent:* Wednesday, June 11, 2014 9:40 PM
>
> *To:* user@spark.apache.org
> *Subject:* Re: Adding external jar to spark-shell classpath in spark 1.0
>
>
>
> This is a known issue: https://issues.apache.org/jira/browse/SPARK-1919.
> We haven't found a fix yet, but for now, you can workaround this by
> including your simple class in your application jar.
>
>
>
> 2014-06-11 10:25 GMT-07:00 Ulanov, Alexander :
>
>  Hi,
>
>
>
> I am currently using spark 1.0 locally on Windows 7. I would like to use
> classes from external jar in the spark-shell. I followed the instruction
> in:
> http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCALrNVjWWF6k=c_jrhoe9w_qaacjld4+kbduhfv0pitr8h1f...@mail.gmail.com%3E
>
>
>
> I have set ADD_JARS=”my.jar” SPARK_CLASSPATH=”my.jar” in spark-shell.cmd
> but this didn’t work.
>
>
>
> I also tried running “spark-shell.cmd --jars my.jar --driver-class-path
> my.jar --driver-library-path my.jar” and it didn’t work either.
>
>
>
> I cannot load any class from my jar into spark shell. Btw my.jar contains
> a simple Scala class.
>
>
>
> Best regards, Alexander
>
>
>


Re: Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Andrew Or
Hi Wang Hao,

This is not removed. We moved it here:
http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
If you're building with SBT, and you don't specify the
SPARK_HADOOP_VERSION, then it defaults to 1.0.4.

Andrew


2014-06-12 6:24 GMT-07:00 Hao Wang :

> Hi, all
>
> Why does the Spark 1.0.0 official doc remove how to build Spark with
> corresponding Hadoop version?
>
> It means that if I don't need to specify the Hadoop version with I build
> my Spark 1.0.0 with `sbt/sbt assembly`?
>
>
> Regards,
> Wang Hao(王灏)
>
> CloudTeam | School of Software Engineering
> Shanghai Jiao Tong University
> Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
> Email:wh.s...@gmail.com
>


Re: use spark-shell in the source

2014-06-12 Thread Andrew Or
Not sure if this is what you're looking for, but have you looked at java's
ProcessBuilder? You can do something like

for (line <- lines) {
  val command = line.split(" ") // You may need to deal with quoted strings
  val process = new ProcessBuilder(command)
  // redirect output of process to main thread
  process.start()
}

Or are you trying to launch an interactive REPL in the middle of your
application?


2014-06-11 22:56 GMT-07:00 JaeBoo Jung :

>  Hi all,
>
>
>
> Can I use spark-shell programmatically in my spark application(in java or
> scala)?
>
> Because I want to convert scala lines to string array and run
> automatically in my application.
>
> For example,
>
> for( var line <- lines){
>
> //run this line in spark shell style and get outputs.
>
> run(line);
>
> }
>
> Thanks
>
> _
>
> *JaeBoo, Jung*
> Assistant Engineer / BDA Lab / Samsung SDS
>


Re: spark master UI does not keep detailed application history

2014-06-16 Thread Andrew Or
Are you referring to accessing a SparkUI for an application that has
finished? First you need to enable event logging while the application is
still running. In Spark 1.0, you set this by adding a line to
$SPARK_HOME/conf/spark-defaults.conf:

spark.eventLog.enabled true

Other than that, the content served on the master UI is largely the same as
before 1.0 is introduced.


2014-06-14 16:43 GMT-07:00 wxhsdp :

> hi, zhen
>   i met the same problem in ec2, application details can not be accessed.
> but i can read stdout
>   and stderr. the problem has not been solved yet
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-master-UI-does-not-keep-detailed-application-history-tp7608p7635.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Andrew Or
Standalone-client mode is not officially supported at the moment. For
standalone-cluster and yarn-client modes, however, they should work.

For both modes, are you running spark-submit from within the cluster, or
outside of it? If the latter, could you try running it from within the
cluster and see if it works? (Does your rtgraph.jar exist on the machine
from which you run spark-submit?)


2014-06-17 2:41 GMT-07:00 Jianshi Huang :

> Hi,
>
> I've stuck using either yarn-client or standalone-client mode. Either will
> stuck when I submit jobs, the last messages it printed were:
>
> ...
> 14/06/17 02:37:17 INFO spark.SparkContext: Added JAR
> file:/x/home/jianshuang/tmp/lib/commons-vfs2.jar at
> http://10.196.195.25:56377/jars/commons-vfs2.jar with timestamp
> 1402997837065
> 14/06/17 02:37:17 INFO spark.SparkContext: Added JAR
> file:/x/home/jianshuang/tmp/rtgraph.jar at
> http://10.196.195.25:56377/jars/rtgraph.jar with timestamp 1402997837065
> 14/06/17 02:37:17 INFO cluster.YarnClusterScheduler: Created
> YarnClusterScheduler
> 14/06/17 02:37:17 INFO yarn.ApplicationMaster$$anon$1: Adding shutdown
> hook for context org.apache.spark.SparkContext@6655cf60
>
> I can use yarn-cluster to run my app but it's not very convenient to
> monitor the progress.
>
> Standalone-cluster mode doesn't work, it reports file not found error:
>
> Driver successfully submitted as driver-20140617023956-0003
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20140617023956-0003 is ERROR
> Exception from cluster was: java.io.FileNotFoundException: File
> file:/x/home/jianshuang/tmp/rtgraph.jar does not exist
>
>
> I'm using Spark 1.0.0 and my submit command looks like this:
>
>   ~/spark/spark-1.0.0-hadoop2.4.0/bin/spark-submit --name 'rtgraph'
> --class com.paypal.rtgraph.demo.MapReduceWriter --master spark://
> lvshdc5en0015.lvs.paypal.com:7077 --jars `find lib -type f | tr '\n' ','`
> --executor-memory 20G --total-executor-cores 96 --deploy-mode cluster
> rtgraph.jar
>
> List of jars I put in --jars option are:
>
> accumulo-core.jar
> accumulo-fate.jar
> accumulo-minicluster.jar
> accumulo-trace.jar
> accumulo-tracer.jar
> chill_2.10-0.3.6.jar
> commons-math.jar
> commons-vfs2.jar
> config-1.2.1.jar
> gson.jar
> guava.jar
> joda-convert-1.2.jar
> joda-time-2.3.jar
> kryo-2.21.jar
> libthrift.jar
> quasiquotes_2.10-2.0.0-M8.jar
> scala-async_2.10-0.9.1.jar
> scala-library-2.10.4.jar
> scala-reflect-2.10.4.jar
>
>
> Anyone has hint what went wrong? Really confused.
>
>
> Cheers,
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>


Re: join operation is taking too much time

2014-06-17 Thread Andrew Or
How long does it get stuck for? This is a common sign for the OS thrashing
due to out of memory exceptions. If you keep it running longer, does it
throw an error?

Depending on how large your other RDD is (and your join operation), memory
pressure may or may not be the problem at all. It could be that spilling
your shuffles
to disk is slowing you down (but probably shouldn't hang your application).
For the 5 RDDs case, what happens if you set spark.shuffle.spill to false?


2014-06-17 5:59 GMT-07:00 MEETHU MATHEW :

>
>  Hi all,
>
> I want  to do a recursive leftOuterJoin between an RDD (created from
>  file) with 9 million rows(size of the file is 100MB) and 30 other
> RDDs(created from 30 diff files in each iteration of a loop) varying from 1
> to 6 million rows.
> When I run it for 5 RDDs,its running successfully  in 5 minutes.But when I
> increase it to 10 or 30 RDDs its gradually slowing down and finally getting
> stuck without showing any warning or error.
>
> I am running in standalone mode with 2 workers of 4GB each and a total of
> 16 cores .
>
> Any of you facing similar problems with JOIN  or is it a problem with my
> configuration.
>
> Thanks & Regards,
> Meethu M
>


Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Andrew Or
Hi Koert and Lukasz,

The recommended way of not hard-coding configurations in your application
is through conf/spark-defaults.conf as documented here:
http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
However, this is only applicable to
spark-submit, so this may not be useful to you.

Depending on how you launch your Spark applications, you can workaround
this by manually specifying these configs as -Dspark.x=y
in your java command to launch Spark. This is actually how SPARK_JAVA_OPTS
used to work before 1.0. Note that spark-submit does
essentially the same thing, but sets these properties programmatically by
reading from the conf/spark-defaults.conf file and calling
System.setProperty("spark.x", "y").

Note that spark.executor.extraJavaOpts not intended for spark configuration
(see http://spark.apache.org/docs/latest/configuration.html).
SPARK_DAEMON_JAVA_OPTS, as you pointed out, is for Spark daemons like the
standalone master, worker, and the history server;
it is also not intended for spark configurations to be picked up by Spark
executors and drivers. In general, any reference to "java opts"
in any variable or config refers to java options, as the name implies, not
Spark configuration. Unfortunately, it just so happened that we
used to mix the two in the same environment variable before 1.0.

Is there a reason you're not using spark-submit? Is it for legacy reasons?
As of 1.0, most changes to launching Spark applications
will be done through spark-submit, so you may miss out on relevant new
features or bug fixes.

Andrew



2014-06-19 7:41 GMT-07:00 Koert Kuipers :

> still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark
> standalone.
>
> for example if i have a akka timeout setting that i would like to be
> applied to every piece of the spark framework (so spark master, spark
> workers, spark executor sub-processes, spark-shell, etc.). i used to do
> that with SPARK_JAVA_OPTS. now i am unsure.
>
> SPARK_DAEMON_JAVA_OPTS works for the master and workers, but not for the
> spark-shell i think? i tried using SPARK_DAEMON_JAVA_OPTS, and it does not
> seem that useful. for example for a worker it does not apply the settings
> to the executor sub-processes, while for SPARK_JAVA_OPTS it does do that.
> so seems like SPARK_JAVA_OPTS is my only way to change settings for the
> executors, yet its deprecated?
>
>
> On Wed, Jun 11, 2014 at 10:59 PM, elyast 
> wrote:
>
>> Hi,
>>
>> I tried to use SPARK_JAVA_OPTS in spark-env.sh as well as conf/java-opts
>> file to set additional java system properties. In this case I could
>> connect
>> to tachyon without any problem.
>>
>> However when I tried setting executor and driver extraJavaOptions in
>> spark-defaults.conf it doesn't.
>>
>> I suspect the root cause may be following:
>>
>> SparkSubmit doesn't fork additional JVM to actually run either driver or
>> executor process and additional system properties are set after JVM is
>> created and other classes are loaded. It may happen that Tachyon
>> CommonConf
>> class is already being loaded and since its Singleton it won't pick up and
>> changes to system properties.
>>
>> Please let me know what do u think.
>>
>> Can I use conf/java-opts ? since it's not really documented anywhere?
>>
>> Best regards
>> Lukasz
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/little-confused-about-SPARK-JAVA-OPTS-alternatives-tp5798p7448.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: Getting started : Spark on YARN issue

2014-06-19 Thread Andrew Or
Hi Praveen,

Yes, the fact that it is trying to use a private IP from outside of the
cluster is suspicious.
My guess is that your HDFS is configured to use internal IPs rather than
external IPs.
This means even though the hadoop confs on your local machine only use
external IPs,
the org.apache.spark.deploy.yarn.Client that is running on your local
machine is trying
to use whatever address your HDFS name node tells it to use, which is
private in this
case.

A potential fix is to update your hdfs-site.xml (and other related configs)
within your
cluster to use public hostnames. Let me know if that does the job.

Andrew


2014-06-19 6:04 GMT-07:00 Praveen Seluka :

> I am trying to run Spark on YARN. I have a hadoop 2.2 cluster (YARN  +
> HDFS) in EC2. Then, I compiled Spark using Maven with 2.2 hadoop profiles.
> Now am trying to run the example Spark job . (In Yarn-cluster mode).
>
> From my *local machine. *I have setup HADOOP_CONF_DIR environment
> variable correctly.
>
> ➜  spark git:(master) ✗ /bin/bash -c "./bin/spark-submit --class
> org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 2
> --driver-memory 2g --executor-memory 2g --executor-cores 1
> examples/target/scala-2.10/spark-examples_*.jar 10"
> 14/06/19 14:59:39 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/06/19 14:59:39 INFO client.RMProxy: Connecting to ResourceManager at
> ec2-54-242-244-250.compute-1.amazonaws.com/54.242.244.250:8050
> 14/06/19 14:59:41 INFO yarn.Client: Got Cluster metric info from
> ApplicationsManager (ASM), number of NodeManagers: 1
> 14/06/19 14:59:41 INFO yarn.Client: Queue info ... queueName: default,
> queueCurrentCapacity: 0.0, queueMaxCapacity: 1.0,
>   queueApplicationCount = 0, queueChildQueueCount = 0
> 14/06/19 14:59:41 INFO yarn.Client: Max mem capabililty of a single
> resource in this cluster 12288
> 14/06/19 14:59:41 INFO yarn.Client: Preparing Local resources
> 14/06/19 14:59:42 WARN hdfs.BlockReaderLocal: The short-circuit local
> reads feature cannot be used because libhadoop cannot be loaded.
> 14/06/19 14:59:43 INFO yarn.Client: Uploading
> file:/home/rgupta/awesome/spark/examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar
> to hdfs://
> ec2-54-242-244-250.compute-1.amazonaws.com:8020/user/rgupta/.sparkStaging/application_1403176373037_0009/spark-examples_2.10-1.0.0-SNAPSHOT.jar
> 14/06/19 15:00:45 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while
> waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending remote=/
> 10.180.150.66:50010]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1305)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1128)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/06/19 15:00:45 INFO hdfs.DFSClient: Abandoning
> BP-1714253233-10.180.215.105-1403176367942:blk_1073741833_1009
> 14/06/19 15:00:46 INFO hdfs.DFSClient: Excluding datanode
> 10.180.150.66:50010
> 14/06/19 15:00:46 WARN hdfs.DFSClient: DataStreamer Exception
>
> Its able to talk to Resource Manager
> Then it puts the example.jar file to HDFS and it fails. Its trying to
> write to datanode. I verified that 50010 port is accessible through local
> machine. Any idea whats the issue here ?
> One thing thats suspicious is */10.180.150.66:50010
>  - it looks like its trying to connect using
> private IP. If so, how can I resolve this to use public IP.*
>
> Thanks
> Praveen
>


Re: Getting started : Spark on YARN issue

2014-06-19 Thread Andrew Or
(Also, an easier workaround is to simply submit the application from within
your
cluster, thus saving you all the manual labor of reconfiguring everything
to use
public hostnames. This may or may not be applicable to your use case.)


2014-06-19 14:04 GMT-07:00 Andrew Or :

> Hi Praveen,
>
> Yes, the fact that it is trying to use a private IP from outside of the
> cluster is suspicious.
> My guess is that your HDFS is configured to use internal IPs rather than
> external IPs.
> This means even though the hadoop confs on your local machine only use
> external IPs,
> the org.apache.spark.deploy.yarn.Client that is running on your local
> machine is trying
> to use whatever address your HDFS name node tells it to use, which is
> private in this
> case.
>
> A potential fix is to update your hdfs-site.xml (and other related
> configs) within your
> cluster to use public hostnames. Let me know if that does the job.
>
> Andrew
>
>
> 2014-06-19 6:04 GMT-07:00 Praveen Seluka :
>
> I am trying to run Spark on YARN. I have a hadoop 2.2 cluster (YARN  +
>> HDFS) in EC2. Then, I compiled Spark using Maven with 2.2 hadoop profiles.
>> Now am trying to run the example Spark job . (In Yarn-cluster mode).
>>
>> From my *local machine. *I have setup HADOOP_CONF_DIR environment
>> variable correctly.
>>
>> ➜  spark git:(master) ✗ /bin/bash -c "./bin/spark-submit --class
>> org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 2
>> --driver-memory 2g --executor-memory 2g --executor-cores 1
>> examples/target/scala-2.10/spark-examples_*.jar 10"
>> 14/06/19 14:59:39 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable
>> 14/06/19 14:59:39 INFO client.RMProxy: Connecting to ResourceManager at
>> ec2-54-242-244-250.compute-1.amazonaws.com/54.242.244.250:8050
>> 14/06/19 14:59:41 INFO yarn.Client: Got Cluster metric info from
>> ApplicationsManager (ASM), number of NodeManagers: 1
>> 14/06/19 14:59:41 INFO yarn.Client: Queue info ... queueName: default,
>> queueCurrentCapacity: 0.0, queueMaxCapacity: 1.0,
>>   queueApplicationCount = 0, queueChildQueueCount = 0
>> 14/06/19 14:59:41 INFO yarn.Client: Max mem capabililty of a single
>> resource in this cluster 12288
>> 14/06/19 14:59:41 INFO yarn.Client: Preparing Local resources
>> 14/06/19 14:59:42 WARN hdfs.BlockReaderLocal: The short-circuit local
>> reads feature cannot be used because libhadoop cannot be loaded.
>> 14/06/19 14:59:43 INFO yarn.Client: Uploading
>> file:/home/rgupta/awesome/spark/examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar
>> to hdfs://
>> ec2-54-242-244-250.compute-1.amazonaws.com:8020/user/rgupta/.sparkStaging/application_1403176373037_0009/spark-examples_2.10-1.0.0-SNAPSHOT.jar
>> 14/06/19 15:00:45 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream
>> org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while
>> waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending remote=/
>> 10.180.150.66:50010]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1305)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1128)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
>> 14/06/19 15:00:45 INFO hdfs.DFSClient: Abandoning
>> BP-1714253233-10.180.215.105-1403176367942:blk_1073741833_1009
>> 14/06/19 15:00:46 INFO hdfs.DFSClient: Excluding datanode
>> 10.180.150.66:50010
>> 14/06/19 15:00:46 WARN hdfs.DFSClient: DataStreamer Exception
>>
>> Its able to talk to Resource Manager
>> Then it puts the example.jar file to HDFS and it fails. Its trying to
>> write to datanode. I verified that 50010 port is accessible through local
>> machine. Any idea whats the issue here ?
>> One thing thats suspicious is */10.180.150.66:50010
>> <http://10.180.150.66:50010> - it looks like its trying to connect using
>> private IP. If so, how can I resolve this to use public IP.*
>>
>> Thanks
>> Praveen
>>
>
>


Re: options set in spark-env.sh is not reflecting on actual execution

2014-06-20 Thread Andrew Or
Hi Meethu,

Are you using Spark 1.0? If so, you should use spark-submit (
http://spark.apache.org/docs/latest/submitting-applications.html), which
has --executor-memory. If you don't want to specify this every time you
submit an application, you can also specify spark.executor.memory in
$SPARK_HOME/conf/spark-defaults.conf (
http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties
).

SPARK_WORKER_MEMORY is for the worker daemon, not your individual
application. A worker can launch many executors, and the value of
SPARK_WORKER_MEMORY is shared across all executors running on that worker.
SPARK_EXECUTOR_MEMORY is deprecated and replaced by
"spark.executor.memory". This is the value you should set.
SPARK_DAEMON_JAVA_OPTS should not be used for setting spark configs, but
instead is intended for java options for worker and master instances (not
for Spark applications). Similarly, you shouldn't be setting
SPARK_MASTER_OPTS or SPARK_WORKER_OPTS to configure your application.

The recommended way for setting spark.* configurations is to do it
programmatically by creating a new SparkConf, set these configurations in
the conf, and pass this conf to the SparkContext (see
http://spark.apache.org/docs/latest/configuration.html#spark-properties).

Andrew



2014-06-18 22:21 GMT-07:00 MEETHU MATHEW :

> Hi all,
>
> I have a doubt regarding the options in spark-env.sh. I set the following
> values in the file in master and 2 workers
>
> SPARK_WORKER_MEMORY=7g
> SPARK_EXECUTOR_MEMORY=6g
> SPARK_DAEMON_JAVA_OPTS+="- Dspark.akka.timeout=30
> -Dspark.akka.frameSize=1 -Dspark.blockManagerHeartBeatMs=80
> -Dspark.shuffle.spill=false
>
> But SPARK_EXECUTOR_MEMORY is showing 4g in web UI.Do I need to change it
> anywhere else to make it 4g and to reflect it in web UI.
>
> A warning is coming that blockManagerHeartBeatMs is exceeding 45 while
> executing a process even though I set it to 80.
>
> So I doubt whether it should be set  as SPARK_MASTER_OPTS
> or SPARK_WORKER_OPTS..
>
> Thanks & Regards,
> Meethu M
>


Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-20 Thread Andrew Or
Well, even before spark-submit the standard way of setting spark
configurations is to create a new SparkConf, set the values in the conf,
and pass this to the SparkContext in your application. It's true that this
involves "hard-coding" these configurations in your application, but these
configurations intended to be application-level settings anyway, rather
than cluster-wide settings. Environment variables are not really ideal for
this purpose, though it's an easy way to change these settings quickly.


2014-06-20 14:03 GMT-07:00 Koert Kuipers :

> thanks for the detailed answer andrew. thats helpful.
>
> i think the main thing thats bugging me is that there is no simple way for
> an admin to always set something on the executors for a production
> environment (an akka timeout comes to mind). yes i could use
> spark-defaults  for that, although that means everything must be submitted
> through spark-submit, which is fairly new and i am not sure how much we
> will use that yet. i will look into that some more.
>
>
> On Thu, Jun 19, 2014 at 6:56 PM, Koert Kuipers  wrote:
>
>> for a jvm application its not very appealing to me to use spark
>> submit my application uses hadoop, so i should use "hadoop jar", and my
>> application uses spark, so it should use "spark-submit". if i add a piece
>> of code that uses some other system there will be yet another suggested way
>> to launch it. thats not very scalable, since i can only launch it one way
>> in the end...
>>
>>
>> On Thu, Jun 19, 2014 at 4:58 PM, Andrew Or  wrote:
>>
>>> Hi Koert and Lukasz,
>>>
>>> The recommended way of not hard-coding configurations in your
>>> application is through conf/spark-defaults.conf as documented here:
>>> http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
>>> However, this is only applicable to
>>> spark-submit, so this may not be useful to you.
>>>
>>> Depending on how you launch your Spark applications, you can workaround
>>> this by manually specifying these configs as -Dspark.x=y
>>> in your java command to launch Spark. This is actually how
>>> SPARK_JAVA_OPTS used to work before 1.0. Note that spark-submit does
>>> essentially the same thing, but sets these properties programmatically
>>> by reading from the conf/spark-defaults.conf file and calling
>>> System.setProperty("spark.x", "y").
>>>
>>> Note that spark.executor.extraJavaOpts not intended for spark
>>> configuration (see
>>> http://spark.apache.org/docs/latest/configuration.html).
>>>  SPARK_DAEMON_JAVA_OPTS, as you pointed out, is for Spark daemons like
>>> the standalone master, worker, and the history server;
>>> it is also not intended for spark configurations to be picked up by
>>> Spark executors and drivers. In general, any reference to "java opts"
>>> in any variable or config refers to java options, as the name implies,
>>> not Spark configuration. Unfortunately, it just so happened that we
>>> used to mix the two in the same environment variable before 1.0.
>>>
>>> Is there a reason you're not using spark-submit? Is it for legacy
>>> reasons? As of 1.0, most changes to launching Spark applications
>>> will be done through spark-submit, so you may miss out on relevant new
>>> features or bug fixes.
>>>
>>> Andrew
>>>
>>>
>>>
>>> 2014-06-19 7:41 GMT-07:00 Koert Kuipers :
>>>
>>> still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark
>>>> standalone.
>>>>
>>>> for example if i have a akka timeout setting that i would like to be
>>>> applied to every piece of the spark framework (so spark master, spark
>>>> workers, spark executor sub-processes, spark-shell, etc.). i used to do
>>>> that with SPARK_JAVA_OPTS. now i am unsure.
>>>>
>>>> SPARK_DAEMON_JAVA_OPTS works for the master and workers, but not for
>>>> the spark-shell i think? i tried using SPARK_DAEMON_JAVA_OPTS, and it does
>>>> not seem that useful. for example for a worker it does not apply the
>>>> settings to the executor sub-processes, while for SPARK_JAVA_OPTS it does
>>>> do that. so seems like SPARK_JAVA_OPTS is my only way to change settings
>>>> for the executors, yet its deprecated?
>>>>
>>>>
>>>> On Wed, Jun 11, 2014 at 10:59 PM, elyast 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>

Re: hi

2014-06-23 Thread Andrew Or
Hm, spark://localhost:7077 should work, because the standalone master binds
to 0.0.0.0. Are you sure you ran `sbin/start-master.sh`?


2014-06-22 22:50 GMT-07:00 Akhil Das :

> Open your webUI in the browser and see the spark url in the top left
> corner of the page and use it while starting your spark shell instead of
> localhost:7077.
>
> Thanks
> Best Regards
>
>
> On Mon, Jun 23, 2014 at 10:56 AM, rapelly kartheek <
> kartheek.m...@gmail.com> wrote:
>
>> Hi
>>   Can someone help me with the following error that I faced while setting
>> up single node spark framework.
>>
>> karthik@karthik-OptiPlex-9020:~/spark-1.0.0$
>> MASTER=spark://localhost:7077 sbin/spark-shell
>> bash: sbin/spark-shell: No such file or directory
>> karthik@karthik-OptiPlex-9020:~/spark-1.0.0$
>> MASTER=spark://localhost:7077 bin/spark-shell
>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>> MaxPermSize=128m; support was removed in 8.0
>> 14/06/23 10:44:53 INFO spark.SecurityManager: Changing view acls to:
>> karthik
>> 14/06/23 10:44:53 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view permissions:
>> Set(karthik)
>> 14/06/23 10:44:53 INFO spark.HttpServer: Starting HTTP Server
>> 14/06/23 10:44:53 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 14/06/23 10:44:53 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:39588
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.0.0
>>   /_/
>>
>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.8.0_05)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>> 14/06/23 10:44:55 INFO spark.SecurityManager: Changing view acls to:
>> karthik
>> 14/06/23 10:44:55 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view permissions:
>> Set(karthik)
>> 14/06/23 10:44:55 INFO slf4j.Slf4jLogger: Slf4jLogger started
>> 14/06/23 10:44:55 INFO Remoting: Starting remoting
>> 14/06/23 10:44:55 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://spark@karthik-OptiPlex-9020:50294]
>> 14/06/23 10:44:55 INFO Remoting: Remoting now listens on addresses:
>> [akka.tcp://spark@karthik-OptiPlex-9020:50294]
>> 14/06/23 10:44:55 INFO spark.SparkEnv: Registering MapOutputTracker
>> 14/06/23 10:44:55 INFO spark.SparkEnv: Registering BlockManagerMaster
>> 14/06/23 10:44:55 INFO storage.DiskBlockManager: Created local directory
>> at /tmp/spark-local-20140623104455-3297
>> 14/06/23 10:44:55 INFO storage.MemoryStore: MemoryStore started with
>> capacity 294.6 MB.
>> 14/06/23 10:44:55 INFO network.ConnectionManager: Bound socket to port
>> 60264 with id = ConnectionManagerId(karthik-OptiPlex-9020,60264)
>> 14/06/23 10:44:55 INFO storage.BlockManagerMaster: Trying to register
>> BlockManager
>> 14/06/23 10:44:55 INFO storage.BlockManagerInfo: Registering block
>> manager karthik-OptiPlex-9020:60264 with 294.6 MB RAM
>> 14/06/23 10:44:55 INFO storage.BlockManagerMaster: Registered BlockManager
>> 14/06/23 10:44:55 INFO spark.HttpServer: Starting HTTP Server
>> 14/06/23 10:44:55 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 14/06/23 10:44:55 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:38307
>> 14/06/23 10:44:55 INFO broadcast.HttpBroadcast: Broadcast server started
>> at http://10.0.1.61:38307
>> 14/06/23 10:44:55 INFO spark.HttpFileServer: HTTP File server directory
>> is /tmp/spark-082a44f6-e877-48cc-8ab7-1bcbcf8136b0
>> 14/06/23 10:44:55 INFO spark.HttpServer: Starting HTTP Server
>> 14/06/23 10:44:55 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 14/06/23 10:44:55 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:58745
>> 14/06/23 10:44:56 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 14/06/23 10:44:56 INFO server.AbstractConnector: Started
>> SelectChannelConnector@0.0.0.0:4040
>> 14/06/23 10:44:56 INFO ui.SparkUI: Started SparkUI at
>> http://karthik-OptiPlex-9020:4040
>> 14/06/23 10:44:56 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable
>> 14/06/23 10:44:56 INFO client.AppClient$ClientActor: Connecting to master
>> spark://localhost:7077...
>> 14/06/23 10:44:56 INFO repl.SparkILoop: Created spark context..
>> 14/06/23 10:44:56 WARN client.AppClient$ClientActor: Could not connect to
>> akka.tcp://sparkMaster@localhost:7077:
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://sparkMaster@localhost:7077]
>> Spark context available as sc.
>>
>> scala> 14/06/23 10:44:56 WARN client.AppClient$ClientActor: Could not
>> connect to akka.tcp://sparkMaster@localhost:7077:
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://sparkMaster@localhost:7077]
>> 14/06/23 10:44:56 WARN client.AppClient$ClientActor: Could not connect to
>> akka.tcp://spark

Re: Error in run spark.ContextCleaner under Spark 1.0.0

2014-06-23 Thread Andrew Or
Hi Haoming,

You can safely disregard this error. This is printed at the end of the
execution when we clean up and kill the daemon context cleaning thread. In
the future it would be good to silence this particular message, as it may
be confusing to users.

Andrew


2014-06-23 12:13 GMT-07:00 Haoming Zhang :

> Hi all,
>
> I tried to run a simple Spark Streaming program with sbt. The compile
> process was correct, but when I run the program I will get an error:
>
> "ERROR spark.ContextCleaner: Error in cleaning thread"
>
> I'm not sure this is a bug or something, because I can get the running
> result as I expected, only an error will be reported.
>
> The following is the full log:
> [info] Set current project to Simple Streaming Project (in build
> file:/home/feicun/workspace/tempStream/)
> [info] Running SimpleStream
> Before // This is a word that I printed by "println"
> 14/06/23 12:03:24 INFO spark.SecurityManager: Changing view acls to: feicun
> 14/06/23 12:03:24 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(feicun)
> 14/06/23 12:03:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 14/06/23 12:03:24 INFO Remoting: Starting remoting
> 14/06/23 12:03:24 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@manjaro:37906]
> 14/06/23 12:03:24 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@manjaro:37906]
> 14/06/23 12:03:24 INFO spark.SparkEnv: Registering MapOutputTracker
> 14/06/23 12:03:24 INFO spark.SparkEnv: Registering BlockManagerMaster
> 14/06/23 12:03:24 INFO storage.DiskBlockManager: Created local directory
> at /tmp/spark-local-20140623120324-3cf5
> 14/06/23 12:03:24 INFO storage.MemoryStore: MemoryStore started with
> capacity 819.3 MB.
> 14/06/23 12:03:24 INFO network.ConnectionManager: Bound socket to port
> 39964 with id = ConnectionManagerId(manjaro,39964)
> 14/06/23 12:03:24 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 14/06/23 12:03:24 INFO storage.BlockManagerInfo: Registering block manager
> manjaro:39964 with 819.3 MB RAM
> 14/06/23 12:03:24 INFO storage.BlockManagerMaster: Registered BlockManager
> 14/06/23 12:03:24 INFO spark.HttpServer: Starting HTTP Server
> 14/06/23 12:03:24 INFO server.Server: jetty-8.1.14.v20131031
> 14/06/23 12:03:24 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:38377
> 14/06/23 12:03:24 INFO broadcast.HttpBroadcast: Broadcast server started
> at http://10.154.17.101:38377
> 14/06/23 12:03:24 INFO spark.HttpFileServer: HTTP File server directory is
> /tmp/spark-f3a10cb8-bdfa-4838-97d1-11bde412f10c
> 14/06/23 12:03:24 INFO spark.HttpServer: Starting HTTP Server
> 14/06/23 12:03:24 INFO server.Server: jetty-8.1.14.v20131031
> 14/06/23 12:03:24 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:51366
> 14/06/23 12:03:24 INFO server.Server: jetty-8.1.14.v20131031
> 14/06/23 12:03:24 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4040
> 14/06/23 12:03:24 INFO ui.SparkUI: Started SparkUI at http://manjaro:4040
> fileStreamorg.apache.spark.streaming.dstream.MappedDStream@64084936 //
> This is a DStream that I expected
> 14/06/23 12:03:25 INFO network.ConnectionManager: Selector thread was
> interrupted!
> 14/06/23 12:03:25 ERROR spark.ContextCleaner: Error in cleaning thread
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:117)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:115)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:115)
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
> at org.apache.spark.ContextCleaner.org
> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:114)
> at
> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
> 14/06/23 12:03:25 ERROR util.Utils: Uncaught exception in thread
> SparkListenerBus
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:996)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:48)
> at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
> at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$an

Re: hi

2014-06-23 Thread Andrew Or
Ah never mind. The 0.0.0.0 is for the UI, not for Master, which uses the
output of the "hostname" command. But yes, long answer short, go to the web
UI and use that URL.


2014-06-23 11:13 GMT-07:00 Andrew Or :

> Hm, spark://localhost:7077 should work, because the standalone master
> binds to 0.0.0.0. Are you sure you ran `sbin/start-master.sh`?
>
>
> 2014-06-22 22:50 GMT-07:00 Akhil Das :
>
> Open your webUI in the browser and see the spark url in the top left
>> corner of the page and use it while starting your spark shell instead of
>> localhost:7077.
>>
>> Thanks
>> Best Regards
>>
>>
>> On Mon, Jun 23, 2014 at 10:56 AM, rapelly kartheek <
>> kartheek.m...@gmail.com> wrote:
>>
>>> Hi
>>>   Can someone help me with the following error that I faced while
>>> setting up single node spark framework.
>>>
>>> karthik@karthik-OptiPlex-9020:~/spark-1.0.0$
>>> MASTER=spark://localhost:7077 sbin/spark-shell
>>> bash: sbin/spark-shell: No such file or directory
>>> karthik@karthik-OptiPlex-9020:~/spark-1.0.0$
>>> MASTER=spark://localhost:7077 bin/spark-shell
>>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>>> MaxPermSize=128m; support was removed in 8.0
>>> 14/06/23 10:44:53 INFO spark.SecurityManager: Changing view acls to:
>>> karthik
>>> 14/06/23 10:44:53 INFO spark.SecurityManager: SecurityManager:
>>> authentication disabled; ui acls disabled; users with view permissions:
>>> Set(karthik)
>>> 14/06/23 10:44:53 INFO spark.HttpServer: Starting HTTP Server
>>> 14/06/23 10:44:53 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/06/23 10:44:53 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:39588
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 1.0.0
>>>   /_/
>>>
>>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.8.0_05)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>> 14/06/23 10:44:55 INFO spark.SecurityManager: Changing view acls to:
>>> karthik
>>> 14/06/23 10:44:55 INFO spark.SecurityManager: SecurityManager:
>>> authentication disabled; ui acls disabled; users with view permissions:
>>> Set(karthik)
>>> 14/06/23 10:44:55 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>> 14/06/23 10:44:55 INFO Remoting: Starting remoting
>>> 14/06/23 10:44:55 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://spark@karthik-OptiPlex-9020:50294]
>>> 14/06/23 10:44:55 INFO Remoting: Remoting now listens on addresses:
>>> [akka.tcp://spark@karthik-OptiPlex-9020:50294]
>>> 14/06/23 10:44:55 INFO spark.SparkEnv: Registering MapOutputTracker
>>> 14/06/23 10:44:55 INFO spark.SparkEnv: Registering BlockManagerMaster
>>> 14/06/23 10:44:55 INFO storage.DiskBlockManager: Created local directory
>>> at /tmp/spark-local-20140623104455-3297
>>> 14/06/23 10:44:55 INFO storage.MemoryStore: MemoryStore started with
>>> capacity 294.6 MB.
>>> 14/06/23 10:44:55 INFO network.ConnectionManager: Bound socket to port
>>> 60264 with id = ConnectionManagerId(karthik-OptiPlex-9020,60264)
>>> 14/06/23 10:44:55 INFO storage.BlockManagerMaster: Trying to register
>>> BlockManager
>>> 14/06/23 10:44:55 INFO storage.BlockManagerInfo: Registering block
>>> manager karthik-OptiPlex-9020:60264 with 294.6 MB RAM
>>> 14/06/23 10:44:55 INFO storage.BlockManagerMaster: Registered
>>> BlockManager
>>> 14/06/23 10:44:55 INFO spark.HttpServer: Starting HTTP Server
>>> 14/06/23 10:44:55 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/06/23 10:44:55 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:38307
>>> 14/06/23 10:44:55 INFO broadcast.HttpBroadcast: Broadcast server started
>>> at http://10.0.1.61:38307
>>> 14/06/23 10:44:55 INFO spark.HttpFileServer: HTTP File server directory
>>> is /tmp/spark-082a44f6-e877-48cc-8ab7-1bcbcf8136b0
>>> 14/06/23 10:44:55 INFO spark.HttpServer: Starting HTTP Server
>>> 14/06/23 10:44:55 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/06/23 10:44:55 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:58745
>>> 14/06/23 10:44:56 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/06/23 10:44:56 INFO server.AbstractConnector: Started
>>> SelectChannelC

Re: problem about cluster mode of spark 1.0.0

2014-06-24 Thread Andrew Or
Hi Randy and Gino,

The issue is that standalone-cluster mode is not officially supported.
Please use standalone-client mode instead, i.e. specify --deploy-mode
client in spark-submit, or simply leave out this config because it defaults
to client mode.

Unfortunately, this is not currently documented anywhere, and the existing
explanation for the distinction between cluster and client modes is highly
misleading. In general, cluster mode means the driver runs on one of the
worker nodes, just like the executors. The corollary is that the output of
the application is not forwarded to command that launched the application
(spark-submit in this case), but is accessible instead through the worker
logs. In contrast, client mode means the command that launches the
application also launches the driver, while the executors still run on the
worker nodes. This means the spark-submit command also returns the output
of the application. For instance, it doesn't make sense to run the Spark
shell in cluster mode, because the stdin / stdout / stderr will not be
redirected to the spark-submit command.

If you are hosting your own cluster and can launch applications from within
the cluster, then there is little benefit for launching your application in
cluster mode, which is primarily intended to cut down the latency between
the driver and the executors in the first place. However, if you are still
intent on using standalone-cluster mode after all, you can use the
deprecated way of launching org.apache.spark.deploy.Client directly through
bin/spark-class. Note that this is not recommended and only serves as a
temporary workaround until we fix standalone-cluster mode through
spark-submit.

I have filed the relevant issues:
https://issues.apache.org/jira/browse/SPARK-2259 and
https://issues.apache.org/jira/browse/SPARK-2260. Thanks for pointing this
out, and we will get to fixing these shortly.

Best,
Andrew


2014-06-20 6:06 GMT-07:00 Gino Bustelo :

> I've found that the jar will be copied to the worker from hdfs fine, but
> it is not added to the spark context for you. You have to know that the jar
> will end up in the driver's working dir, and so you just add a the file
> name if the jar to the context in your program.
>
> In your example below, just add to the context "test.jar".
>
> Btw, the context will not have the master URL either, so add that while
> you are at it.
>
> This is a big issue. I've posted about it a week ago and no replies.
> Hopefully it gets more attention as more people start hitting this.
> Basically, spark-submit on standalone cluster with cluster deploy is broken.
>
> Gino B.
>
> > On Jun 20, 2014, at 2:46 AM, randylu  wrote:
> >
> > in addition, jar file can be copied to driver node automatically.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-cluster-mode-of-spark-1-0-0-tp7982p7984.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-24 Thread Andrew Or
Hi all,

The short answer is that standalone-cluster mode through spark-submit is
broken (and in fact not officially supported). Please use standalone-client
mode instead.
The long answer is provided here:
http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3cCAMJOb8m6gF9B3W=p12hi88mexkoon15-1jkvs8pblmuwh9r...@mail.gmail.com%3e

Andrew


2014-06-19 12:00 GMT-07:00 lannyripple :

> Gino,
>
> I can confirm that your solution of assembling with spark-streaming-kafka
> but excluding spark-core and spark-streaming has me working with basic
> spark-submit.  As mentioned you must specify the assembly jar in the
> SparkConfig as well as to spark-submit.
>
> When I see the error you are now experiencing I just restart my cluster
> (sbin/stop-all.sh; sleep 6; sbin/start-all.sh).  My thought is a resource
> leak somewhere but I haven't tried to chase it down since restarting is
> nice
> and quick.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-SPARK-CLASSPATH-tp7356p7941.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Does PUBLIC_DNS environment parameter really works?

2014-06-24 Thread Andrew Or
Hi Peng,

What you're looking for is SPARK_MASTER_IP, which defaults to the output of
the command "hostname" (see sbin/start-master.sh).

What SPARK_PUBLIC_DNS does is it changes what the Master or the Worker
advertise to others. If this is set, the links on the Master and Worker web
UI will use public addresses instead of private ones, for example. This is
useful if you're browsing through these UIs locally from your machine
outside of the cluster.

Best,
Andrew


2014-06-24 21:01 GMT-07:00 Peng Cheng :

> I'm deploying a cluster to Amazon EC2, trying to override its internal ip
> addresses with public dns
>
> I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2
> public DNS]
>
> But it doesn't change anything on the web UI, it still shows internal ip
> address
>
> Spark Master at spark://ip-172-31-32-12:7077
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-PUBLIC-DNS-environment-parameter-really-works-tp8237.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Spark 1.0.0 on yarn cluster problem

2014-06-25 Thread Andrew Or
Hi Sophia, did you ever resolve this?

A common cause for not giving resources to the job is that the RM cannot
communicate with the workers.
This itself has many possible causes. Do you have a full stack trace from
the logs?

Andrew


2014-06-13 0:46 GMT-07:00 Sophia :

> With the yarn-client mode,I submit a job from client to yarn,and the spark
> file spark-env.sh:
> export HADOOP_HOME=/usr/lib/hadoop
> export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
> SPARK_EXECUTOR_INSTANCES=4
> SPARK_EXECUTOR_CORES=1
> SPARK_EXECUTOR_MEMORY=1G
> SPARK_DRIVER_MEMORY=2G
> SPARK_YARN_APP_NAME="Spark 1.0.0"
>
> the command line and the result:
>  $export JAVA_HOME=/usr/java/jdk1.7.0_45/
> $export PATH=$JAVA_HOME/bin:$PATH
> $  ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-client
> ./bin/spark-submit: line 44: /usr/lib/spark/bin/spark-class: Success
> How can I do with it? The yarn only accept the job but it cannot give
> memory
> to the job.Why?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-on-yarn-cluster-problem-tp7560.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: About StorageLevel

2014-06-26 Thread Andrew Or
Hi Kang,

You raise a good point. Spark does not automatically cache all your RDDs.
Why? Simply because the application may create many RDDs, and not all of
them are to be reused. After all, there is only so much memory available to
each executor, and caching an RDD adds some overhead especially if we have
to kick out old blocks with LRU. As an example, say you run the following
chain:

sc.textFile(...).map(...).filter(...).flatMap(...).map(...).reduceByKey(...).count()

You might be interested in reusing only the final result, but each step of
the chain actually creates an RDD. If we automatically cache all RDDs, then
we'll end up doing extra work for the RDDs we don't care about. The effect
can be much worse if our RDDs are big and there are many of them, in which
case there may be a lot of churn in the cache as we constantly evict RDDs
we reuse. After all, the users know best what RDDs they are most interested
in, so it makes sense to give them control over caching behavior.

Best,
Andrew



2014-06-26 5:36 GMT-07:00 tomsheep...@gmail.com :

> Hi all,
>
> I have a newbie question about StorageLevel of spark. I came up with
> these sentences in spark documents:
>
> If your RDDs fit comfortably with the default storage level (MEMORY_ONLY),
> leave them that way. This is the most CPU-efficient option, allowing
> operations on the RDDs to run as fast as possible.
>
> And
>
> Spark automatically monitors cache usage on each node and drops out old
> data partitions in a least-recently-used (LRU) fashion. If you would like
> to manually remove an RDD instead of waiting for it to fall out of the
> cache, use the RDD.unpersist() method.
> http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
>
>
> But I found the default storageLevel is NONE in source code, and if I
> never call 'persist(somelevel)', that value will always be NONE. The
> 'iterator' method goes to
>
> final def iterator(split: Partition, context: TaskContext): Iterator[T] =
> {
> if (storageLevel != StorageLevel.NONE) {
> SparkEnv.get.cacheManager.getOrCompute(this, split, context,
> storageLevel)
> } else {
> computeOrReadCheckpoint(split, context)
> }
> }
> Is that to say, the rdds are cached in memory (or somewhere else) if and
> only if the 'persist' or 'cache' method is called explicitly,
> otherwise they will be re-computed every time even in an iterative
> situation?
> It made me confused becase I had a first impression that spark is
> super-fast because it prefers to store intermediate results in memory
> automatically.
>
> Forgive me if I asked a stupid question.
>
> Regards,
> Kang Liu
>


Re: Run spark unit test on Windows 7

2014-07-02 Thread Andrew Or
Hi Konstatin,

We use hadoop as a library in a few places in Spark. I wonder why the path
includes "null" though.

Could you provide the full stack trace?

Andrew


2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com>:

> Hi all,
>
> I'm trying to run some transformation on *Spark*, it works fine on
> cluster (YARN, linux machines). However, when I'm trying to run it on local
> machine (*Windows 7*) under unit test, I got errors:
>
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
> at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
> at org.apache.hadoop.util.Shell.(Shell.java:326)
> at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
> at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>
>
> My code is following:
>
> @Test
> def testETL() = {
> val conf = new SparkConf()
> val sc = new SparkContext("local", "test", conf)
> try {
> val etl = new IxtoolsDailyAgg() // empty constructor
>
> val data = sc.parallelize(List("in1", "in2", "in3"))
>
> etl.etl(data) // rdd transformation, no access to SparkContext or 
> Hadoop
> Assert.assertTrue(true)
> } finally {
> if(sc != null)
> sc.stop()
> }
> }
>
>
> Why is it trying to access hadoop at all? and how can I fix it? Thank you
> in advance
>
> Thank you,
> Konstantin Kudryavtsev
>


Re: Help alleviating OOM errors

2014-07-02 Thread Andrew Or
Hi Yana,

In 0.9.1, spark.shuffle.spill is set to true by default so you shouldn't
need to manually set it.

Here are a few common causes of OOMs in Spark:

- Too few partitions: if one partition is too big, it may cause an OOM if
there is not enough space to unroll the entire partition in memory. This
will be fixed in Spark 1.1, but until then you can try to increase the
number of partitions so each core on each executor has less data to handle.

- Your application is using a lot of memory on its own: Spark be default
assumes that it has 90% of the runtime memory in your JVM. If your
application is super memory-intensive (e.g. creates large data structures),
then I would either try to reduce the memory footprint of your application
itself if possible, or reduce the amount of memory Spark thinks it owns.
For the latter, I would reduce spark.storage.memoryFraction.

To answer your questions:

- Yes, Spark will spill your RDD partitions to disk if they exceed 60% of
your runtime memory, ONLY if you persisted with MEMORY_AND_DISK, however.
The default is MEMORY_ONLY, which means they RDDs will just be kicked out
of the cache if your cache is full.

- In general, if an OOM occurs it would be good to re-run your application
after making the changes to your application suggested above. If one
executor dies because of an OOM, another executor might also run into it.
Even if the original task that caused an OOM is re-scheduled on a different
executor, it is likely that the other executor will also die of the same
problem.

Best,
Andrew


2014-07-02 6:22 GMT-07:00 Yana Kadiyska :

> Can you elaborate why "You need to configure the  spark.shuffle.spill
> true again in the config" -- the default for spark.shuffle.spill is
> set to true according to the
> doc(https://spark.apache.org/docs/0.9.1/configuration.html)?
>
> On OOM the tasks were process_local, which I understand is "as good as
> it gets" but still going on 32+ hours.
>
> On Wed, Jul 2, 2014 at 2:40 AM, Mayur Rustagi 
> wrote:
> >
> >
> > Mayur Rustagi
> > Ph: +1 (760) 203 3257
> > http://www.sigmoidanalytics.com
> > @mayur_rustagi
> >
> >
> >
> > On Mon, Jun 30, 2014 at 8:09 PM, Yana Kadiyska 
> > wrote:
> >>
> >> Hi,
> >>
> >> our cluster seems to have a really hard time with OOM errors on the
> >> executor. Periodically we'd see a task that gets sent to a few
> >> executors, one would OOM, and then the job just stays active for hours
> >> (sometimes 30+ whereas normally it completes sub-minute).
> >>
> >> So I have a few questions:
> >>
> >> 1. Why am I seeing OOMs to begin with?
> >>
> >> I'm running with defaults for
> >> spark.storage.memoryFraction
> >> spark.shuffle.memoryFraction
> >>
> >> so my understanding is that if Spark exceeds 60% of available memory,
> >> data will be spilled to disk? Am I misunderstanding this? In the
> >> attached screenshot, I see a single stage with 2 tasks on the same
> >> executor -- no disk spills but OOM.
> >
> > You need to configure the  spark.shuffle.spill true again in the config,
> > What is causing you to OOM, it could be that you are trying to just
> simply
> > sortbykey & keys are bigger memory of executor causing the OOM, can you
> put
> > the stack.
> >>
> >>
> >> 2. How can I reduce the likelyhood of seeing OOMs -- I am a bit
> >> concerned that I don't see a spill at all so not sure if decreasing
> >> spark.storage.memoryFraction is what needs to be done
> >
> >
> >>
> >>
> >> 3. Why does an OOM seem to break the executor so hopelessly? I am
> >> seeing times upwards of 30hrs once an OOM occurs. Why is that -- the
> >> task *should* take under a minute, so even if the whole RDD was
> >> recomputed from scratch, 30hrs is very mysterious to me. Hadoop can
> >> process this in about 10-15 minutes, so I imagine even if the whole
> >> job went to disk it should still not take more than an hour
> >
> > When OOM occurs it could cause the RDD to spill to disk, the repeat task
> may
> > be forced to read data from disk & cause the overall slowdown, not to
> > mention the RDD may be send to different executor to be processed, are
> you
> > seeing the slow tasks as process_local  or node_local atleast?
> >>
> >>
> >> Any insight into this would be much appreciated.
> >> Running Spark 0.9.1
> >
> >
>


Re: NullPointerException on ExternalAppendOnlyMap

2014-07-02 Thread Andrew Or
Hi Konstantin,

Thanks for reporting this. This happens because there are null keys in your
data. In general, Spark should not throw null pointer exceptions, so this
is a bug. I have fixed this here: https://github.com/apache/spark/pull/1288.

For now, you can workaround this by special-handling your null keys before
passing your key value pairs to a combine operator (e.g. groupBy,
reduceBy). For instance, rdd.map { case (k, v) => if (k == null)
(SPECIAL_VALUE, v) else (k, v) }.

Best,
Andrew




2014-07-02 10:22 GMT-07:00 Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com>:

> Hi all,
>
> I catch very confusing exception running Spark 1.0 on HDP2.1
> During save rdd as text file I got:
>
>
> 14/07/02 10:11:12 WARN TaskSetManager: Loss was due to 
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$getMorePairs(ExternalAppendOnlyMap.scala:254)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:237)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:236)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.(ExternalAppendOnlyMap.scala:236)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:218)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:162)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
>
>
> Do you have any idea what is it? how can I debug this issue or perhaps
> access another log?
>
>
> Thank you,
> Konstantin Kudryavtsev
>


Re: write event logs with YARN

2014-07-03 Thread Andrew Or
Hi Christophe, another Andrew speaking.

Your configuration looks fine to me. From the stack trace it seems that we
are in fact closing the file system pre-maturely elsewhere in the system,
such that when it tries to write the APPLICATION_COMPLETE file it throws
the exception you see. This does look like a potential bug in Spark.
Tracing the source of this may take a little, but we will start looking
into it.

I'm assuming if you manually create your own APPLICATION_COMPLETE file then
the entries should show up. Unfortunately I don't see another workaround
for this, but we'll fix this as soon as possible.

Andrew


2014-07-03 1:44 GMT-07:00 Christophe Préaud :

>  Hi Andrew,
>
> This does not work (the application failed), I have the following error
> when I put 3 slashes in the hdfs scheme:
> (...)
> Caused by: java.lang.IllegalArgumentException: Pathname /
> dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
> from hdfs:/
> dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
> is not a valid DFS filename.
> (...)
>
> Besides, I do not think that there is an issue with the hdfs path name
> since only the empty APPLICATION_COMPLETE file is missing (with
> "spark.eventLog.dir=hdfs://:9000/user//spark-events"),
> all other files are correctly created, e.g.:
> hdfs dfs -ls spark-events/kelkoo.searchkeywordreport-1404376178470
> Found 3 items
> -rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29
> spark-events/kelkoo.searchkeywordreport-1404376178470/COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
> -rwxrwx---   1 kookel supergroup 137948 2014-07-03 08:32
> spark-events/kelkoo.searchkeywordreport-1404376178470/EVENT_LOG_2
> -rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29
> spark-events/kelkoo.searchkeywordreport-1404376178470/SPARK_VERSION_1.0.0
>
> You help is appreciated though, do not hesitate if you have any other idea
> on how to fix this.
>
> Thanks,
> Christophe.
>
>
> On 03/07/2014 01:49, Andrew Lee wrote:
>
> Hi Christophe,
>
>  Make sure you have 3 slashes in the hdfs scheme.
>
>  e.g.
>
>  hdfs:*///*:9000/user//spark-events
>
>  and in the spark-defaults.conf as well.
> spark.eventLog.dir=hdfs:*///*
> :9000/user//spark-events
>
>
> > Date: Thu, 19 Jun 2014 11:18:51 +0200
> > From: christophe.pre...@kelkoo.com
> > To: user@spark.apache.org
> > Subject: write event logs with YARN
> >
> > Hi,
> >
> > I am trying to use the new Spark history server in 1.0.0 to view
> finished applications (launched on YARN), without success so far.
> >
> > Here are the relevant configuration properties in my spark-defaults.conf:
> >
> > spark.yarn.historyServer.address=:18080
> > spark.ui.killEnabled=false
> > spark.eventLog.enabled=true
> > spark.eventLog.compress=true
> >
> spark.eventLog.dir=hdfs://:9000/user//spark-events
> >
> > And the history server has been launched with the command below:
> >
> > /opt/spark/sbin/start-history-server.sh
> hdfs://:9000/user//spark-events
> >
> >
> > However, the finished application do not appear in the history server UI
> (though the UI itself works correctly).
> > Apparently, the problem is that the APPLICATION_COMPLETE file is not
> created:
> >
> > hdfs dfs -stat %n spark-events/-1403166516102/*
> > COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
> > EVENT_LOG_2
> > SPARK_VERSION_1.0.0
> >
> > Indeed, if I manually create an empty APPLICATION_COMPLETE file in the
> above directory, the application can now be viewed normally in the history
> server.
> >
> > Finally, here is the relevant part of the YARN application log, which
> seems to imply that
> > the DFS Filesystem is already closed when the APPLICATION_COMPLETE file
> is created:
> >
> > (...)
> > 14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with
> SUCCEEDED
> > 14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be
> successfully unregistered.
> > 14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal.
> > 14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory
> .sparkStaging/application_1397477394591_0798
> > 14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from
> shutdown hook
> > 14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at
> http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877
> > 14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler
> > 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all
> executors
> > 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each
> executor to shut down
> > 14/06/19 08:29:30 INFO MapOutputTrackerMasterActor:
> MapOutputTrackerActor stopped!
> > 14/06/19 08:29:30 INFO ConnectionManager: Selector thread was
> interrupted!
> > 14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped
> > 14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared
> > 14/06/19 08:29:30 INFO BlockManager: 

Re: tiers of caching

2014-07-07 Thread Andrew Or
Others have also asked for this on the mailing list, and hence there's a
related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur
brings up a good point in that any current implementation of in-memory
shuffles will compete with application RDD blocks. I think we should
definitely add this at some point. In terms of a timeline, we already have
many features lined up for 1.1, however, so it will likely be after that.


2014-07-07 10:13 GMT-07:00 Ankur Dave :

> I think tiers/priorities for caching are a very good idea and I'd be
> interested to see what others think. In addition to letting libraries cache
> RDDs liberally, it could also unify memory management across other parts of
> Spark. For example, small shuffles benefit from explicitly keeping the
> shuffle outputs in memory rather than writing it to disk, possibly due to
> filesystem overhead. To prevent in-memory shuffle outputs from competing
> with application RDDs, Spark could mark them as lower-priority and specify
> that they should be dropped to disk when memory runs low.
>
> Ankur 
>
>


Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or
I will assume that you are running in yarn-cluster mode. Because the driver
is launched in one of the containers, it doesn't make sense to expose port
4040 for the node that contains the container. (Imagine if multiple driver
containers are launched on the same node. This will cause a port
collision). If you're launching Spark from a gateway node that is
physically near your worker nodes, then you can just launch your
application in yarn-client mode, in which case the SparkUI will always be
started on port 4040 on the node that you ran spark-submit on. The reason
why sometimes you see the red text is because it appears only on the driver
containers, not the executor containers. This is because SparkUI belongs to
the SparkContext, which only exists on the driver.

Andrew


2014-07-07 11:20 GMT-07:00 Yan Fang :

> Hi guys,
>
> Not sure if you  have similar issues. Did not find relevant tickets in
> JIRA. When I deploy the Spark Streaming to YARN, I have following two
> issues:
>
> 1. The UI port is random. It is not default 4040. I have to look at the
> container's log to check the UI port. Is this suppose to be this way?
>
> 2. Most of the time, the UI does not work. The difference between logs are
> (I ran the same program):
>
>
>
>
>
>
> *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
> 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
> server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
> 14/07/03 11:38:51 INFO
> executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
> 11:38:51 INFO executor.Executor: Running task ID 0...*
>
> 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
> 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 14/07/02 16:55:32 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:14211
>
>
>
>
> *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
> INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
> server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
>  14/07/02 16:55:32 INFO
> ui.SparkUI: Started SparkUI at http://myNodeName:21867
> 14/07/02 16:55:32 INFO
> cluster.YarnClusterScheduler: Created YarnClusterScheduler*
>
> When the red part comes, the UI works sometime. Any ideas? Thank you.
>
> Best,
>
> Fang, Yan
> yanfang...@gmail.com
> +1 (206) 849-4108
>


Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or
@Yan, the UI should still work. As long as you look into the container that
launches the driver, you will find the SparkUI address and port. Note that
in yarn-cluster mode the Spark driver doesn't actually run in the
Application Manager; just like the executors, it runs in a container that
is launched by the Resource Manager after the Application Master requests
the container resources. In contrast, in yarn-client mode, your driver is
not launched in a container, but in the client process that launched your
application (i.e. spark-submit), so the stdout of this program directly
contains the SparkUI messages.

@Chester, I'm not sure what has gone wrong as there are many factors at
play here. When you go the Resource Manager UI, does the "application URL"
link point you to the same SparkUI address as indicated in the logs? If so,
this is the correct behavior. However, I believe the redirect error has
little to do with Spark itself, but more to do with how you set up the
cluster. I have actually run into this myself, but I haven't found a
workaround. Let me know if you find anything.




2014-07-07 12:07 GMT-07:00 Chester Chen :

> As Andrew explained, the port is random rather than 4040, as the the spark
> driver is started in Application Master and the port is random selected.
>
>
> But I have the similar UI issue. I am running Yarn Cluster mode against my
> local CDH5 cluster.
>
> The log states
> "14/07/07 11:59:29 INFO ui.SparkUI: Started SparkUI at
> http://10.0.0.63:58750
>
> "
>
>
> but when you client the spark UI link (ApplicationMaster or
>
> http://10.0.0.63:58750), I will got a 404 with the redirect URI
>
>
>  http://localhost/proxy/application_1404443455764_0010/
>
>
>
> Looking at the Spark code, notice that the "proxy" is reallya variable to get 
> the proxy at the yarn-site.xml http address. But when I specified the value 
> at yarn-site.xml, it still doesn't work for me.
>
>
>
> Oddly enough, it works for my co-worker on Pivotal HD cluster, therefore I am 
> still looking what's the difference in terms of cluster setup or something 
> else.
>
>
> Chester
>
>
>
>
>
> On Mon, Jul 7, 2014 at 11:42 AM, Andrew Or  wrote:
>
>> I will assume that you are running in yarn-cluster mode. Because the
>> driver is launched in one of the containers, it doesn't make sense to
>> expose port 4040 for the node that contains the container. (Imagine if
>> multiple driver containers are launched on the same node. This will cause a
>> port collision). If you're launching Spark from a gateway node that is
>> physically near your worker nodes, then you can just launch your
>> application in yarn-client mode, in which case the SparkUI will always be
>> started on port 4040 on the node that you ran spark-submit on. The reason
>> why sometimes you see the red text is because it appears only on the driver
>> containers, not the executor containers. This is because SparkUI belongs to
>> the SparkContext, which only exists on the driver.
>>
>> Andrew
>>
>>
>> 2014-07-07 11:20 GMT-07:00 Yan Fang :
>>
>> Hi guys,
>>>
>>> Not sure if you  have similar issues. Did not find relevant tickets in
>>> JIRA. When I deploy the Spark Streaming to YARN, I have following two
>>> issues:
>>>
>>> 1. The UI port is random. It is not default 4040. I have to look at the
>>> container's log to check the UI port. Is this suppose to be this way?
>>>
>>> 2. Most of the time, the UI does not work. The difference between logs
>>> are (I ran the same program):
>>>
>>>
>>>
>>>
>>>
>>>
>>> *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
>>> 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
>>> server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
>>> <http://SocketConnector@0.0.0.0:12026>14/07/03 11:38:51 INFO
>>> executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
>>> 11:38:51 INFO executor.Executor: Running task ID 0...*
>>>
>>> 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
>>> 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/07/02 16:55:32 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:14211
>>>
>>>
>>>
>>>
>>> *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
>>> INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
>>> server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
>>> <http://SelectChannelConnector@0.0.0.0:21867> 14/07/02 16:55:32 INFO
>>> ui.SparkUI: Started SparkUI at http://myNodeName:21867
>>> <http://myNodeName:21867>14/07/02 16:55:32 INFO
>>> cluster.YarnClusterScheduler: Created YarnClusterScheduler*
>>>
>>> When the red part comes, the UI works sometime. Any ideas? Thank you.
>>>
>>> Best,
>>>
>>> Fang, Yan
>>> yanfang...@gmail.com
>>> +1 (206) 849-4108
>>>
>>
>>
>


Re: Scheduling in spark

2014-07-08 Thread Andrew Or
Here's the most updated version of the same page:
http://spark.apache.org/docs/latest/job-scheduling


2014-07-08 12:44 GMT-07:00 Sujeet Varakhedi :

> This is a good start:
>
> http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html
>
>
> On Tue, Jul 8, 2014 at 9:11 AM, rapelly kartheek 
> wrote:
>
>> Hi,
>>   I am a post graduate student, new to spark. I want to understand how
>> Spark scheduler works. I just have theoretical understanding of DAG
>> scheduler and the underlying task scheduler.
>>
>> I want to know, given a job to the framework, after the DAG scheduler
>> phase, how the scheduling happens??
>>
>> Can someone help me out as to how to proceed in these lines.  I have some
>> exposure towards Hadoop schedulers and tools like Mumak simulator for
>> experiments.
>>
>> Can someone please tell me how to perform simulations on Spark w.r.to
>>  schedulers.
>>
>>
>> Thanks in advance
>>
>
>


Re: Spark: All masters are unresponsive!

2014-07-08 Thread Andrew Or
It seems that your driver (which I'm assuming you launched on the master
node) can now connect to the Master, but your executors cannot. Did you
make sure that all nodes have the same conf/spark-defaults.conf,
conf/spark-env.sh, and conf/slaves? It would be good if you can post the
stderr of the executor logs here. They are located on the worker node under
$SPARK_HOME/work.

(As of Spark-1.0, we recommend that you use the spark-submit arguments, i.e.

bin/spark-submit --master spark://pzxnvm2018.x.y.name.org:7077
--executor-memory 4g --executor-cores 3 --class   )


2014-07-08 10:12 GMT-07:00 Sameer Tilak :

> Hi Akhil et al.,
> I made the following changes:
>
> In spark-env.sh I added the following three entries (standalone mode)
>
> export SPARK_MASTER_IP=pzxnvm2018.x.y.name.org
> export SPARK_WORKER_MEMORY=4G
> export SPARK_WORKER_CORES=3
>
> I then use start-master and start-slaves commands to start the services.
> Another sthing that I have noticed is that the number of cores that I
> specified is npot  used: 2022 shows up with only 1 core and 2023 and 2024
> show up with 4 cores.
>
> In the Web UI:
> URL: spark://pzxnvm2018.x.y.name.org:7077
>
> I run the spark shell command from pzxnvm2018.
>
> /etc/hosts on my master node has following entry:
> master-ip  pzxnvm2018.x.y.name.org pzxnvm2018
>
> /etc/hosts on my a worker node has following entry:
> worker-ippzxnvm2023.x.y.name.org pzxnvm2023
>
>
> However, on my master node log file I still see this:
>
> ERROR EndpointWriter: AssociationError [akka.tcp://
> sparkmas...@pzxnvm2018.x.y.name.org:7077] -> 
> [akka.tcp://spark@localhost:43569]:
> Error [Association failed with [akka.tcp://spark@localhost:43569]]
>
> My spark-shell has the following o/p
>
>
> scala> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Connected to
> Spark cluster with app ID app-20140708100139-
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-/0 on
> worker-20140708095558-pzxnvm2024.x.y.name.orgg-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-/0 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-/1 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-/1 on hostPort pzxnvm2023.x.y.name.org:38294 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-/2 on
> worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (
> pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-/2 on hostPort pzxnvm2022.x.y.name.org:41826 with
> 1 cores, 512.0 MB RAM
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-/0 is now RUNNING
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-/1 is now RUNNING
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-/2 is now RUNNING
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-/0 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-/0 removed: Command exited with code 1
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-/3 on
> worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-/3 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-/3 is now RUNNING
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-/1 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-/1 removed: Command exited with code 1
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-/4 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-/4 on hostPort pzxnvm2023.x.y.name.org:38294 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-/4 is now RUNNING
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-/2 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:43 INFO SparkDeploySchedulerBa

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Andrew Or
Hi Mikhail,

It looks like the documentation is a little out-dated. Neither is true
anymore. In general, we try to shift away from short options ("-em", "-dm"
etc.) in favor of more explicit ones ("--executor-memory",
"--driver-memory"). These options, and "--cores", refer to the arguments
passed in to org.apache.spark.deploy.Client, the now deprecated way of
launching an application in standalone clusters.

SPARK_MASTER_IP/PORT are only used for binding the Master, not to configure
which master the driver connects to. The proper way to specify this is
through "spark.master" in your config or the "--master" parameter to
spark-submit.

We will update the documentation shortly. Thanks for letting us know.
Andrew



2014-07-08 16:29 GMT-07:00 Mikhail Strebkov :

> Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month
> ago). The setup is a standalone cluster with 4 worker machines and 1 master
> machine. I used to run spark shell like this:
>
>   ./bin/spark-shell -c 30 -em 20g -dm 10g
>
> Today I've finally updated to Spark 1.0 release. Now I can only run spark
> shell like this:
>
>   ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores
> 30
> --executor-memory 20g --driver-memory 10g
>
> The documentation at
> http://spark.apache.org/docs/latest/spark-standalone.html says:
>
> "You can also pass an option --cores  to control the number of
> cores that spark-shell uses on the cluster."
> This doesn't work, you need to pass "--total-executor-cores "
> instead.
>
> "Note that if you are running spark-shell from one of the spark cluster
> machines, the bin/spark-shell script will automatically set MASTER from the
> SPARK_MASTER_IP and SPARK_MASTER_PORT variables in conf/spark-env.sh."
> This is not working for me too. I run the shell from the master machine,
> and
> I do have SPARK_MASTER_IP set up in conf/spark-env.sh like this:
> export SPARK_MASTER_IP='10.2.1.5'
> But if I omit "--master spark://10.2.1.5:7077" then the console starts
> but I
> can't see it in the UI at http://10.2.1.5:8080. But when I go to
> http://10.2.1.5:4040 (the application UI) I see that the app is using only
> master as a worker.
>
> My question is: are those just the bugs in the documentation? That there is
> no --cores option and that SPARK_MASTER_IP is not used anymore when I run
> the Spark shell from the master?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Andrew Or
>> "The proper way to specify this is through "spark.master" in your config
or the "--master" parameter to spark-submit."

By "this" I mean configuring which master the driver connects to (not which
port and address the standalone Master binds to).


2014-07-08 16:43 GMT-07:00 Andrew Or :

> Hi Mikhail,
>
> It looks like the documentation is a little out-dated. Neither is true
> anymore. In general, we try to shift away from short options ("-em", "-dm"
> etc.) in favor of more explicit ones ("--executor-memory",
> "--driver-memory"). These options, and "--cores", refer to the arguments
> passed in to org.apache.spark.deploy.Client, the now deprecated way of
> launching an application in standalone clusters.
>
> SPARK_MASTER_IP/PORT are only used for binding the Master, not to
> configure which master the driver connects to. The proper way to specify
> this is through "spark.master" in your config or the "--master" parameter
> to spark-submit.
>
> We will update the documentation shortly. Thanks for letting us know.
> Andrew
>
>
>
> 2014-07-08 16:29 GMT-07:00 Mikhail Strebkov :
>
> Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month
>> ago). The setup is a standalone cluster with 4 worker machines and 1
>> master
>> machine. I used to run spark shell like this:
>>
>>   ./bin/spark-shell -c 30 -em 20g -dm 10g
>>
>> Today I've finally updated to Spark 1.0 release. Now I can only run spark
>> shell like this:
>>
>>   ./bin/spark-shell --master spark://10.2.1.5:7077
>> --total-executor-cores 30
>> --executor-memory 20g --driver-memory 10g
>>
>> The documentation at
>> http://spark.apache.org/docs/latest/spark-standalone.html says:
>>
>> "You can also pass an option --cores  to control the number of
>> cores that spark-shell uses on the cluster."
>> This doesn't work, you need to pass "--total-executor-cores "
>> instead.
>>
>> "Note that if you are running spark-shell from one of the spark cluster
>> machines, the bin/spark-shell script will automatically set MASTER from
>> the
>> SPARK_MASTER_IP and SPARK_MASTER_PORT variables in conf/spark-env.sh."
>> This is not working for me too. I run the shell from the master machine,
>> and
>> I do have SPARK_MASTER_IP set up in conf/spark-env.sh like this:
>> export SPARK_MASTER_IP='10.2.1.5'
>> But if I omit "--master spark://10.2.1.5:7077" then the console starts
>> but I
>> can't see it in the UI at http://10.2.1.5:8080. But when I go to
>> http://10.2.1.5:4040 (the application UI) I see that the app is using
>> only
>> master as a worker.
>>
>> My question is: are those just the bugs in the documentation? That there
>> is
>> no --cores option and that SPARK_MASTER_IP is not used anymore when I run
>> the Spark shell from the master?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


  1   2   3   >