Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Wei Tan
Thank you all, Madhu, Gerard and Ryan. All your suggestions work. Personally I prefer running Spark locally in Eclipse for debugging purpose. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/

Spark Worker Core Allocation

2014-06-07 Thread Subacini B
Hi All, My cluster has 5 workers each having 4 cores (So total 20 cores).It is in stand alone mode (not using Mesos or Yarn).I want two programs to run at same time. So I have configured "spark.cores.max=3" , but when i run the program it allocates three cores taking one core from each worker mak

Re: How to process multiple classification with SVM in MLlib

2014-06-07 Thread Xiangrui Meng
At this time, you need to do one-vs-all manually for multiclass training. For your second question, if the algorithm is implemented in Java/Scala/Python and designed for single machine, you can broadcast the dataset to each worker, train models on workers. If the algorithm is implemented in a diffe

Re: Gradient Descent with MLBase

2014-06-07 Thread DB Tsai
Hi Aslan, You can check out the unittest code of GradientDescent.runMiniBatchSGD https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala Sincerely, DB Tsai --- My Blog: h

Dumping Metics on HDFS

2014-06-07 Thread Rahul Singhal
Hi All, I am running spark applications in yarn-cluster mode and need to read the spark application metrics even after the application is over. I was planning to use the csv sink, but it seems that codehale's CsvReporter only supports dumping metrics to local filesystem. Any suggestions to nav

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Xu (Simon) Chen
Is there a way to start tachyon on top of a yarn cluster? On Jun 7, 2014 2:11 PM, "Marek Wiewiorka" wrote: > I was also thinking of using tachyon to store parquet files - maybe > tomorrow I will give a try as well. > > > 2014-06-07 20:01 GMT+02:00 Michael Armbrust : > >> Not a stupid question!

Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Madhu
For debugging, I run locally inside Eclipse without maven. I just add the Spark assembly jar to my Eclipse project build path and click 'Run As... Scala Application'. I have done the same with Java and Scala Test, it's quick and easy. I didn't see any third party jar dependencies in your code, so t

Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Ryan Compton
Sounds like there's two questions here: First, from the command line, if you "mvn package" and then run the code with "java -cp targe/*jar-with-dependencies.jar com.ibm.App" do you still get the error? Second, for quick debugging, I agree that it's a pain to wait for mvn package to finish every t

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Marek Wiewiorka
I was also thinking of using tachyon to store parquet files - maybe tomorrow I will give a try as well. 2014-06-07 20:01 GMT+02:00 Michael Armbrust : > Not a stupid question! I would like to be able to do this. For now, you > might try writing the data to tachyon

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Michael Armbrust
Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen wrote: > This

[graphx] PageRank with Edge weights

2014-06-07 Thread Lee Becker
Hello, I have been playing around with GraphX and its PageRank capabilities. Something I'm not seeing in the documentation is how to initialize PageRank using edge weights. Is this even possible, or would I need to reimplement the PageRank algorithm so that it can use an Edge property as part of

Re: Using Java functions in Spark

2014-06-07 Thread Oleg Proudnikov
Increasing number of partitions on data file solved the problem. On 6 June 2014 18:46, Oleg Proudnikov wrote: > Additional observation - the map and mapValues are pipelined and executed > - as expected - in pairs. This means that there is a simple sequence of > steps - first read from Cassandra

serialization a model

2014-06-07 Thread filipus
am I right when I just use cPickle for seriailzation a model (see code below) or didnt I get it with PickleSerializer (from pyspark.serializers import PickleSerializer) ... model = LogisticRegressionWithSGD.train(parsedData) mm = open("mm.txt","wb") import cPickle cPickle.dump(model,mm) mm.clo

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-07 Thread Mayur Rustagi
QueueStream example is in Spark Streaming examples: http://www.boyunjian.com/javasrc/org.spark-project/spark-examples_2.9.3/0.7.2/_/spark/streaming/examples/QueueStream.scala Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: error loading large files in PySpark 0.9.0

2014-06-07 Thread Nick Pentreath
Ah looking at that inputformat it should just work out the box using sc.newAPIHadoopFile ... Would be interested to hear if it works as expected for you (in python you'll end up with bytearray values). N — Sent from Mailbox On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman wrote: > Oh cool,

How to process multiple classification with SVM in MLlib

2014-06-07 Thread littlebird
Hi All, As we know, In MLlib the SVM is used for binary classification. I wonder how to train SVM model for mutiple classification in MLlib. In addition, how to apply the machine learning algorithm in Spark if the algorithm isn't included in MLlib. Thank you. -- View this message in context:

Re: New user streaming question

2014-06-07 Thread Michael Campbell
Thanks all - I still don't know what the underlying problem is, but I KIND OF got it working by dumping my random-words stuff to a file and pointing spark streaming to that. So it's not "Streaming", as such, but I got output. More investigation to follow =) On Sat, Jun 7, 2014 at 8:22 AM, Gino

ec2 deployment regions supported

2014-06-07 Thread Joe Mathai
Hi , I am interested in deploying spark 1.0.0 on ec2 and wanted to know which all regions are supported.I have been able to deploy the previous version in east but i had a hard time launching the cluster due to bad connection the script provided would fail to ssh into a node after a couple of trie

Gradient Descent with MLBase

2014-06-07 Thread Aslan Bekirov
Hi All, I have to create a model using SGD in mlbase. I examined a bit mlbase and run some samples of classification , collaborative filtering etc.. But I could not run Gradient descent. I have to run "val model = GradientDescent.runMiniBatchSGD(params)" of course before params must be compute

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-07 Thread Mayur Rustagi
So you can run a job / spark job to get data to disk/hdfs. Then run a dstream from a hdfs folder. As you move your files, the dstream will kick in. Regards Mayur On 6 Jun 2014 21:13, "Gianluca Privitera" < gianluca.privite...@studio.unibo.it> wrote: > Where are the API for QueueStream and RddQueu

Re: Best practise for 'Streaming' dumps?

2014-06-07 Thread Gino Bustelo
Have you thought of using window? Gino B. > On Jun 6, 2014, at 11:49 PM, Jeremy Lee > wrote: > > > It's going well enough that this is a "how should I in 1.0.0" rather than > "how do i" question. > > So I've got data coming in via Streaming (twitters) and I want to archive/log > it all. It

Re: New user streaming question

2014-06-07 Thread Gino Bustelo
I would make sure that your workers are running. It is very difficult to tell from the console dribble if you just have no data or the workers just disassociated from masters. Gino B. > On Jun 6, 2014, at 11:32 PM, Jeremy Lee > wrote: > > Yup, when it's running, DStream.print() will print o

Re: Using Spark on Data size larger than Memory size

2014-06-07 Thread Vibhor Banga
Aaron, Thank You for your response and clarifying things. -Vibhor On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson wrote: > There is no fundamental issue if you're running on data that is larger > than cluster memory size. Many operations can stream data through, and thus > memory usage is inde

Re: Scheduling code for Spark

2014-06-07 Thread Gerard Maas
Hi, The scheduling related code can be found at: https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler The DAG (Directed Acyclic Graph) scheduler is a good start point: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/

Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Gerard Maas
I think that you have two options: - to run your code locally, you can use local mode by using the 'local' master like so: new SparkConf().setMaster("local[4]") where 4 is the number of cores assigned to the local mode. - to run your code remotely you need to build the jar with dependencies and

Spark with Spark Streaming

2014-06-07 Thread b0c1
Hi! There are any way to use spark with spark streaming together to create real time architecture? How can I merge the spark and spark streaming result at realtime (and drop streaming result if spark result generated)? Thanks -- View this message in context: http://apache-spark-user-list.1001

Scheduling code for Spark

2014-06-07 Thread rapelly kartheek
Hi, *I am new to Spark framework. I understood Spark framework to some extent. I have some experience with Hadoop as well. The concepts of in-memory computation and RDD's *are extremely fascinating. I am trying to understand the scheduler of Spark framework. Can someone help me out where to l