Re: DeepLearning and Spark ?

2015-01-10 Thread Jaonary Rabarisoa
Can someone explain what is the difference between parameter server and spark ? There's already an issue on this topic https://issues.apache.org/jira/browse/SPARK-4590 Another example of DL in Spark essentially based on downpour SDG http://deepdist.com On Sat, Jan 10, 2015 at 2:27 AM, Peng Che

ALS.trainImplicit running out of mem when using higher rank

2015-01-10 Thread Antony Mayi
the memory requirements seem to be rapidly growing hen using higher rank... I am unable to get over 20 without running out of memory. is this expected?thanks, Antony. 

Re: Data locality running Spark on Mesos

2015-01-10 Thread Timothy Chen
Hi Michael, I see you capped the cores to 60. I wonder what's the settings you used for standalone mode that you compared with? I can try to run a MLib workload on both to compare. Tim > On Jan 9, 2015, at 6:42 AM, Michael V Le wrote: > > Hi Tim, > > Thanks for your response. > > The ben

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-10 Thread Antony Mayi
the actual case looks like this:* spark 1.1.0 on yarn (cdh 5.2.1)* ~8-10 executors, 36GB phys RAM per host* input RDD is roughly 3GB containing ~150-200M items (and this RDD is made persistent using .cache())* using pyspark yarn is configured with the limit yarn.nodemanager.resource.memory-mb of 

Re: Parquet compression codecs not applied

2015-01-10 Thread Ayoub Benali
it worked thanks. this doc page recommends to use "spark.sql.parquet.compression.codec" to set the compression coded and I thought this setting would be forwarded to the hive context given that HiveContext extends SQLContext, but it w

status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Kevin Burton
I’m curious what the status of implementing hive analytics functions in spark. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Many of these seem missing. I’m assuming they’re not implemented yet? Is there an ETA on them? or am I the first to bring this up

Does Spark automatically run different stages concurrently when possible?

2015-01-10 Thread YaoPau
I'm looking for ways to reduce the runtime of my Spark job. My code is a single file of scala code and is written in this order: (1) val lines = Import full dataset using sc.textFile (2) val ABonly = Parse out all rows that are not of type A or B (3) val processA = Process only the A rows from AB

Re: Web Service + Spark

2015-01-10 Thread Cui Lin
Thanks, Gaurav and Corey, Probably I didn’t make myself clear. I am looking for best Spark practice similar to Shiny for R, the analysis/visualziation results can be easily published to web server and shown from web browser. Or any dashboard for Spark? Best regards, Cui Lin From: gtinside mai

Re: FileNotFoundException in appcache shuffle files

2015-01-10 Thread lucio raimondo
Hey, I am having a "similar" issue, did you manage to find a solution yet? Please check my post below for reference: http://apache-spark-user-list.1001560.n3.nabble.com/IOError-Errno-2-No-such-file-or-directory-tmp-spark-9e23f17e-2e23-4c26-9621-3cb4d8b832da-tmp3i3xno-td21076.html Thank you, Luc

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-10 Thread Benyi Wang
You may try to change the schudlingMode to FAIR, the default is FIFO. Take a look at this page https://spark.apache.org/docs/1.1.0/job-scheduling.html#scheduling-within-an-application On Sat, Jan 10, 2015 at 10:24 AM, YaoPau wrote: > I'm looking for ways to reduce the runtime of my Spark job.

Re: IOError: [Errno 2] No such file or directory: '/tmp/spark-9e23f17e-2e23-4c26-9621-3cb4d8b832da/tmp3i3xno'

2015-01-10 Thread lucio raimondo
Update: I resolved this by increasing the granularity of RDD persistence for complex map-reduce operations, as the one whose reduceByKey stage was failing. Coolio. Lucio -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/IOError-Errno-2-No-such-file-or-directo

Re: FileNotFoundException in appcache shuffle files

2015-01-10 Thread Aaron Davidson
As Jerry said, this is not related to shuffle file consolidation. The unique thing about this problem is that it's failing to find a file while trying to _write_ to it, in append mode. The simplest explanation for this would be that the file is deleted in between some check for existence and openi

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-10 Thread Stéphane Verlet
>From your pseudo code, it would be sequential and done twice 1+2+3 then 1+2+4 If you do a .cache() in step 2 then you would have 1+2+3 , then 4 I ran several steps in parrallel from the same program but never using the same source RDD so I do not know the limitations there. I simply started

Re: status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Will Benton
Hi Kevin, I'm currently working on implementing windowing. If you'd like to see something that's not covered by a JIRA, please file one! best, wb - Original Message - > From: "Kevin Burton" > To: user@spark.apache.org > Sent: Saturday, January 10, 2015 12:12:38 PM > Subject: status o

Re: Job priority

2015-01-10 Thread Mark Hamstra
-dev, +user http://spark.apache.org/docs/latest/job-scheduling.html On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta wrote: > Is it possible to specify a priority level for a job, such that the active > jobs might be scheduled in order of priority? > > Alex >

train many decision tress with a single spark job

2015-01-10 Thread Josh Buffum
I've got a data set of activity by user. For each user, I'd like to train a decision tree model. I currently have the feature creation step implemented in Spark and would naturally like to use mllib's decision tree model. However, it looks like the decision tree model expects the whole RDD and will

How can I measure the time an RDD takes to execute?

2015-01-10 Thread Saiph Kappa
Hi, How can I measure the time an RDD takes to execute? In particular, I want to do it for the following piece of code: « val ssc = new StreamingContext(sparkConf, Seconds(5)) val distFile = ssc.textFileStream("/home/myuser/twitter-dump") val words = distFile.flatMap(_.split(" ")).filter(_.leng

Re: Discrepancy in PCA values

2015-01-10 Thread Upul Bandara
Hi Xiangrui, Thanks a lot for you answer. So I fixed my Julia code, also calculated PCA using R as well. R Code: - data <- read.csv('/home/upul/Desktop/iris.csv'); X <- data[,1:4] pca <- prcomp(X, center = TRUE, scale=FALSE) transformed <- predict(pca, newdata = X) Julia Code (Fixed)

[no subject]

2015-01-10 Thread Krishna Sankar
Guys, registerTempTable("Employees") gives me the error Exception in thread "main" scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [/Applications/eclipse/plugins/org.scala-lang.scala-library_2.11.4.

Re: Spark Graph Visualizer

2015-01-10 Thread kevinkim
Hi Rajesh, There's a great web-based notebook & visualize tool called Zeppelin. (And it's opensource!) Check it out: http://zeppelin.incubator.apache.org Regards, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Graph-Visualizer-tp21074p21079.h

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-10 Thread Nathan McCarthy
Thanks Cheng & Michael! Makes sense. Appreciate the tips! Idiomatic scala isn't performant. I’ll definitely start using while loops or tail recursive methods. I have noticed this in the spark code base. I might try turning off columnar compression (via spark.sql.inMemoryColumnarStorage.compress

Re: Job priority

2015-01-10 Thread Alessandro Baretta
Mark, Thanks, but I don't see how this documentation solves my problem. You are referring me to documentation of fair scheduling; whereas, I am asking about as unfair a scheduling policy as can be: a priority queue. Alex On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra wrote: > -dev, +user > > ht

Re: Play Scala Spark Exmaple

2015-01-10 Thread Akhil Das
What is your spark version that is running on the EC2 cluster? From the build file of your play application it seems that it uses Spark 1.0.1. Thanks Best Regards On Fri, Jan 9, 2015 at 7:17 PM, Eduardo Cusa < eduardo.c...@usmedi

Re: java.io.IOException: Mkdirs failed to create file:/some/path/myapp.csv while using rdd.saveAsTextFile(fileAddress) Spark

2015-01-10 Thread Akhil Das
That's a file on the local disk that's being created (prolly your hdfs temp dir), just make sure you have write permission(for the user from where you are running the application) on that directory. Thanks Best Regards On Sat, Jan 10, 2015 at 12:58 AM, firemonk9 wrote: > I am facing same except

Re: Job priority

2015-01-10 Thread Cody Koeninger
http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties "Setting a high weight such as 1000 also makes it possible to implement *priority* between pools—in essence, the weight-1000 pool will always get to launch tasks first whenever it has jobs active." On Sat, Jan 10,

Re: Issue writing to Cassandra from Spark

2015-01-10 Thread Akhil Das
Just make sure you are not connecting to the Old RPC Port (9160), new binary port is running on 9042. What is your rpc_address listed in cassandra.yaml? Also make sure you have start_native_transport: *true *in the yaml file. Thanks Best Regards On Sat, Jan 10, 2015 at 8:44 AM, Ankur Srivastava

Re: Job priority

2015-01-10 Thread Alessandro Baretta
Cody, Maybe I'm not getting this, but it doesn't look like this page is describing a priority queue scheduling policy. What this section discusses is how resources are shared between queues. A weight-1000 pool will get 1000 times more resources allocated to it than a priority 1 queue. Great, but n

Does DecisionTree model in MLlib deal with missing values?

2015-01-10 Thread Carter
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal with missing values? If so, what data structure should I use for the input? Moreover, my data has categorical features, but the LabeledPoint requires "double" data type, in this case what can I do? Thank you very much.

Removing JARs from spark-jobserver

2015-01-10 Thread Sasi
How to remove submitted JARs from spark-jobserver? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Removing-JARs-from-spark-jobserver-tp21081.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Removing JARs from spark-jobserver

2015-01-10 Thread abhishek
There is path /tmp/spark-jobserver/file where all the jar are kept by default. probably deleting from there should work On 11 Jan 2015 12:51, "Sasi [via Apache Spark User List]" < ml-node+s1001560n21081...@n3.nabble.com> wrote: > How to remove submitted JARs from spark-jobserver? > > > >