Re: How to pass config variables to workers

2014-05-16 Thread Theodore Wong
I found that the easiest way was to pass variables in the Spark configuration object. The only catch is that all of your properties keys must being with "spark." in order for Spark to propagate the values. So, for example, in the driver: SparkConf conf = new SparkConf(); conf.set("spark.myapp.mypr

Re: Dead lock running multiple Spark jobs on Mesos

2014-05-16 Thread Martin Weindel
Andrew, thanks for your response. When using the coarse mode, the jobs run fine. My problem is the fine-grained mode. Here the parallel jobs nearly always end in a dead lock. It seems to have something to do with resource allocation, as Mesos shows neither used nor idle CPU resources in this

help me: Out of memory when spark streaming

2014-05-16 Thread Francis . Hu
hi, All I encountered OOM when streaming. I send data to spark streaming through Zeromq at a speed of 600 records per second, but the spark streaming only handle 10 records per 5 seconds( set it in streaming program) my two workers have 4 cores CPU and 1G RAM. These workers always occur Out

Debugging Spark AWS S3

2014-05-16 Thread Robert James
I have Spark code which runs beautifully when MASTER=local. When I run it with MASTER set to a spark ec2 cluster, the workers seem to run, but the results, which are supposed to be put to AWS S3, don't appear on S3. I'm at a loss for how to debug this. I don't see any S3 exceptions anywhere. Ca

Re: Using String Dataset for Logistic Regression

2014-05-16 Thread praveshjain1991
Thank you for your reply. So i take it that there's no direct way of using String datasets while using LR in Spark. -Pravesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p5810.html Sent from the Apache

Re: Express VMs - good idea?

2014-05-16 Thread Matei Zaharia
Hey Marco, if you’re primarily interested in trying Spark, you can also just get a binary build from Apache: http://spark.apache.org/downloads.html. You only need Java on your machine to run it. To see it work with the rest of the Hadoop ecosystem components it is probably better to use a VM. M

Re: Proper way to create standalone app with custom Spark version

2014-05-16 Thread Soumya Simanta
Install your custom spark jar to your local maven or ivy repo. Use this custom jar in your pom/sbt file. > On May 15, 2014, at 3:28 AM, Andrei wrote: > > (Sorry if you have already seen this message - it seems like there were some > issues delivering messages to the list yesterday) > > We

Re: How to pass config variables to workers

2014-05-16 Thread Andrew Or
Not a hack, this is documented here: http://spark.apache.org/docs/0.9.1/configuration.html, and is in fact the proper way of setting per-application Spark configurations. Additionally, you can specify default Spark configurations so you don't need to manually set it for all applications. If you ar

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
i do not think the current solution will work. i tried writing a version of ChildExecutorURLClassLoader that does have a proper parent and has a modified loadClass to reverse the order of parent and child in finding classes, and that seems to work, but now classes like SparkEnv are loaded by the ch

Re: Debugging Spark AWS S3

2014-05-16 Thread Ian Ferreira
Did you check the executor stderr logs? On 5/16/14, 2:37 PM, "Robert James" wrote: >I have Spark code which runs beautifully when MASTER=local. When I >run it with MASTER set to a spark ec2 cluster, the workers seem to >run, but the results, which are supposed to be put to AWS S3, don't >appear

Re: Is there any problem on the spark mailing list?

2014-05-16 Thread ssimanta
Same here. I've posted a bunch of questions in the last few days and they don't show up here and I'm also not getting email to my (gmail.com) account. I came here to post directly on the mailing list but saw this thread instead. At least, I'm not alone. -- View this message in context: http://

Re: Historical Data as Stream

2014-05-16 Thread Soumya Simanta
File is just a steam with a fixed length. Usually streams don't end but in this case it would. On the other hand if you real your file as a steam may not be able to use the entire data in the file for your analysis. Spark (give enough memory) can process large amounts of data quickly. > On M

Re: Understanding epsilon in KMeans

2014-05-16 Thread Krishna Sankar
Stuti, - The two numbers at different contexts, but finally end up in two sides of an && operator. - A parallel K-Means consists of multiple iterations which in turn consists of moving centroids around. A centroids would be deemed stabilized when the root square distance between suc

Re: Understanding epsilon in KMeans

2014-05-16 Thread Brian Gawalt
Hi Stuti, I think you're right. The epsilon parameter is indeed used as a threshold for deciding when KMeans has converged. If you look at line 201 of mllib's KMeans.scala: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L201 you ca

Re: different in spark on yarn mode and standalone mode

2014-05-16 Thread Sandy Ryza
We made several stabilization changes to Spark on YARN that made it into Spark 0.9.1 and CDH5.0. 1.0 significantly simplifies submitting a Spark app to a YARN cluster (wildly different invocations are no longer needed for yarn-client and yarn-cluster mode). I'm not sure about who is running it in

Re: SparkContext startup time out

2014-05-16 Thread Sophia
How did you deal with this problem finally?I also met with it. Best regards, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-tp1753p5739.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: A new resource for getting examples of Spark RDD API calls

2014-05-16 Thread zhen
Thanks for the suggestion. I will look into this. Zhen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529p5532.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Benchmarking Spark with YCSB

2014-05-16 Thread Jay Vyas
I'm not sure what you mean... YCSB is for transactional systems. Spark isnt really in that category - its an analytics platform. RDDs by their very nature are not transactional., On Fri, May 16, 2014 at 6:37 AM, bhusted wrote: > Can anyone comment on what it would take to run Spark with YCSB

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Xiangrui Meng
Hi Andrew, This is the JIRA I created: https://issues.apache.org/jira/browse/MAPREDUCE-5893 . Hopefully someone wants to work on it. Best, Xiangrui On Fri, May 16, 2014 at 6:47 PM, Xiangrui Meng wrote: > Hi Andre, > > I could reproduce the bug with Hadoop 2.2.0. Some older version of > Hadoop d

Re: Schema view of HadoopRDD

2014-05-16 Thread Mayur Rustagi
so you can use a input output format & read it whichever way you write... You can additionally provide variables in hadoop configuration to configure. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, May 8, 2014 at

Re: Passing runtime config to workers?

2014-05-16 Thread DB Tsai
Since the evn variables in driver will not be passed into workers, the most easy way you can do is refer to the variables directly in workers from driver. For example, val variableYouWantToUse = System.getenv("something defined in env") rdd.map( you can access `variableYouWantToUse` here ) Si

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-16 Thread Xiangrui Meng
It doesn't work if you put the netlib-native jar inside an assembly jar. Try to mark it "provided" in the dependencies, and use --jars to include them with spark-submit. -Xiangrui On Wed, May 14, 2014 at 6:12 PM, wxhsdp wrote: > Hi, DB > > i tried including breeze library by using spark 1.0, it

Re: Standalone client failing with docker deployed cluster

2014-05-16 Thread Bharath Ravi Kumar
(Trying to bubble up the issue again...) Any insights (based on the enclosed logs) into why standalone client invocation might fail while issuing jobs through the spark console succeeded? Thanks, Bharath On Thu, May 15, 2014 at 5:08 PM, Bharath Ravi Kumar wrote: > Hi, > > I'm running the spark

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Xiangrui Meng
Hi Andre, I could reproduce the bug with Hadoop 2.2.0. Some older version of Hadoop do not support splittable compression, so you ended up with sequential reads. It is easy to reproduce the bug with the following setup: 1) Workers are configured with multiple cores. 2) BZip2 files are big enough

Re: Express VMs - good idea?

2014-05-16 Thread Mayur Rustagi
Frankly if you can give enough CPU performance to VM it should be good... but for development setting up locally is better 1. debuggable in IDE 2. Faster 3. samples like run-example etc Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: What is the difference between a Spark Worker and a Spark Slave?

2014-05-16 Thread Andrew Ash
They are different terminology for the same thing and should be interchangeable. On Fri, May 16, 2014 at 2:02 PM, Robert James wrote: > What is the difference between a Spark Worker and a Spark Slave? >

Re: How to pass config variables to workers

2014-05-16 Thread Theodore Wong
Sorry, yes, you are right, the documentation does indeed explain that setting spark.* options is the way to pass Spark configuration options to workers. Additionally, we've use the same mechanism to pass application-specific configuration options to workers; the "hack" part is naming our applicatio

Re: How to use spark-submit

2014-05-16 Thread Andrew Or
What kind of cluster mode are you running on? You may need to specify the jar through --jar, though we're working on making spark-submit automatically add the provided jar on the class path so we don't run into ClassNotFoundException as you have. What is the command that you ran? On Tue, May 6,

Historical Data as Stream

2014-05-16 Thread Laeeq Ahmed
Hi, I have data in a file. Can I read it as Stream in spark? I know it seems odd to read file as stream but it has practical applications in real life if I can read it as stream. It there any other tools which can give this file as stream to Spark or I have to make batches manually which is not

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Xiangrui Meng
Hi Andrew, I submitted a patch and verified it solves the problem. You can download the patch from https://issues.apache.org/jira/browse/HADOOP-10614 . Best, Xiangrui On Fri, May 16, 2014 at 6:48 PM, Xiangrui Meng wrote: > Hi Andrew, > > This is the JIRA I created: > https://issues.apache.org/j

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-16 Thread DB Tsai
After reading the spark code more carefully, spark does `Thread.currentThread().setContextClassLoader` to the custom classloader. However, the classes have to be used via reflection with this approach. See, http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-with

Re: run spark0.9.1 on yarn with hadoop CDH4

2014-05-16 Thread Sandy Ryza
Hi Sophia, Unfortunately, Spark doesn't work against YARN in CDH4. The YARN APIs changed quite a bit before finally being stabilized in Hadoop 2.2 and CDH5. Spark on YARN supports Hadoop 0.23.* and Hadoop 2.2+ / CDH5.0+, but does not support CDH4, which is somewhere in between. -Sandy On Fri,

<    1   2