Is Spark History Server supported for Mesos?

2015-12-09 Thread Kelvin Chu
Spark on YARN can use History Server by setting the configuration spark.yarn.historyServer.address. But, I can't find similar config for Mesos. Is History Server supported by Spark on Mesos? Thanks. Kelvin

Re: Combining Many RDDs

2015-03-26 Thread Kelvin Chu
Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable 'data' is a Scala collection and compute() returns an RDD. Right? If yes, I tried the approach below to operate on one RDD only during the whole computation (Yes, I also saw that too many RDD hurt performance). Cha

Re: job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-27 Thread Kelvin Chu
Hi Darin, you might increase spark.yarn.executor.memoryOverhead to see if it fixes the problem. Please take a look of this report: https://issues.apache.org/jira/browse/SPARK-4996 On Fri, Feb 27, 2015 at 12:38 AM, Arush Kharbanda < ar...@sigmoidanalytics.com> wrote: > Can you share what error you

Re: Running out of space (when there's no shortage)

2015-02-27 Thread Kelvin Chu
Hi Joe, you might increase spark.yarn.executor.memoryOverhead to see if it fixes the problem. Please take a look of this report: https://issues.apache.org/jira/browse/SPARK-4996 Hope this helps. On Tue, Feb 24, 2015 at 2:05 PM, Yiannis Gkoufas wrote: > No problem, Joe. There you go > https://is

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
0.2), and the rest is for basic > Spark bookkeeping and anything the user does inside UDFs. > > -Sandy > > > > On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu <2dot7kel...@gmail.com> > wrote: > >> Hi Sandy, >> >> I am also doing memory

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters i

Re: Setting the number of executors in standalone mode

2015-02-20 Thread Kelvin Chu
Hi, Currently, there is only one executor per worker. There is jira ticket to relax this: https://issues.apache.org/jira/browse/SPARK-1706 But, if you want to use more cores, maybe, you can try increasing SPARK_WORKER_INSTANCES. It increases the number of workers per machine. Take a look here: h

Re: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-19 Thread Kelvin Chu
Hi Mohammed, Did you use --jars to specify your jdbc driver when you submitted your job? Take a look of this link: http://spark.apache.org/docs/1.2.0/submitting-applications.html Hope this help! Kelvin On Thu, Feb 19, 2015 at 7:24 PM, Mohammed Guller wrote: > Hi – > > I am trying to use Bone

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Kelvin Chu
I had a similar use case before. I found: 1. textFile() produced one partition per file. It can result in many partitions. I found that calling coalecse() without shuffle helped. 2. If you used persist(), count() will do I/O and put the result into cache. Transformation later did computation out

Re: Can spark job server be used to visualize streaming data?

2015-02-10 Thread Kelvin Chu
Hi Su, Out of the box, no. But, I know people integrate it with Spark Streaming to do real-time visualization. It will take some work though. Kelvin On Mon, Feb 9, 2015 at 5:04 PM, Su She wrote: > Hello Everyone, > > I was reading this blog post: > http://homes.esat.kuleuven.be/~bioiuser/blog/

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Kelvin Chu
Since the stacktrace shows kryo is being used, maybe, you could also try increasing spark.kryoserializer.buffer.max.mb. Hope this help. Kelvin On Tue, Feb 10, 2015 at 1:26 AM, Akhil Das wrote: > You could try increasing the driver memory. Also, can you be more specific > about the data volume?

Re: no space left at worker node

2015-02-08 Thread Kelvin Chu
Maybe, try with "local:" under the heading of Advanced Dependency Management here: https://spark.apache.org/docs/1.1.0/submitting-applications.html It seems this is what you want. Hope this help. Kelvin On Sun, Feb 8, 2015 at 9:13 PM, ey-chih chow wrote: > Is there any way we can disable Spark

Re: no space left at worker node

2015-02-08 Thread Kelvin Chu
I guess you may set the parameters below to clean the directories: spark.worker.cleanup.enabled spark.worker.cleanup.interval spark.worker.cleanup.appDataTtl They are described here: http://spark.apache.org/docs/1.2.0/spark-standalone.html Kelvin On Sun, Feb 8, 2015 at 5:15 PM, ey-chih chow wr

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Kelvin Chu
Joe, I also use S3 and gzip. So far the I/O is not a problem. In my case, the operation is SQLContext.JsonFile() and I can see from Ganglia that the whole cluster is CPU bound (99% saturated). I have 160 cores and I can see the network can sustain about 150MBit/s. Kelvin On Wed, Feb 4, 2015 at 10

Re: Interactive interface tool for spark

2014-10-08 Thread Kelvin Chu
Hi Andy, It sounds great! Quick questions: I have been using IPython + PySpark. I crunch the data by PySpark and then visualize the data by Python libraries like matplotlib and basemap. Could I still use these Python libraries in the Scala Notebook? If not, what is suggested approaches for visuali