Task Recalculate or toal failure due to fectchError

2014-03-16 Thread guojc
Hi there, In our experiment with spark, we found same spark application has large variance on execution time and sometimes even fail totally. And in the log, we find this usually due to task resubmit from fetch failure, with log as following, 14/03/16 16:40:38 WARN TaskSetManager: Lost TID

Contributing pyspark ports

2014-03-16 Thread Krakna H
Is there any documentation on contributing pyspark ports of additions to Spark? I only see guidelines on Scala contributions ( https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark). Specifically, I'm interested in porting mllib and graphx contributions. -- View this message i

[Powered by] Yandex Islands powered by Spark

2014-03-16 Thread Egor Pahomov
Hi, page https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Sparksays I need write here, if want my project to be added there. In Yandex (www.yandex.com) now we using spark for project Yandex Islands ( http://www.searchenginejournal.com/yandex-islands-markup-issues-implementation/71891/)

Separating classloader management from SparkContexts

2014-03-16 Thread Punya Biswal
Hi all, I'm trying to use Spark to support users who are interactively refining the code that processes their data. As a concrete example, I might create an RDD[String] and then write several versions of a function to map over the RDD until I'm satisfied with the transformation. Right now, once I

Maximum memory limits

2014-03-16 Thread Debasish Das
Hi, I gave my spark job 16 gb of memory and it is running on 8 executors. The job needs more memory due to ALS requirements (20M x 1M matrix) On each node I do have 96 gb of memory and I am using 16 gb out of it. I want to increase the memory but I am not sure what is the right way to do that...

Re: Maximum memory limits

2014-03-16 Thread Sean Owen
Are you using HEAD or 0.9.0? I know there was a memory issue fixed a few weeks ago that made ALS need a lot more memory than is needed. https://github.com/apache/incubator-spark/pull/629 Try the latest code. -- Sean Owen | Director, Data Science | London On Sun, Mar 16, 2014 at 11:40 AM, Debas

Re: Maximum memory limits

2014-03-16 Thread Debasish Das
Thanks Sean...let me get the latest code..do you know which PR was it ? But will the executors run fine with say 32 gb or 64 gb of memory ? Does not JVM shows up issues when the max memory goes beyond certain limit... Also the failure is due to GC limits from jblas...and I was thinking that jblas

How to kill a spark app ?

2014-03-16 Thread Debasish Das
Are these the right options: 1. If there is a spark script, just do a ctrl-c from spark-shell and the job will be killed property. 2. For spark application also ctrl c will kill the job property on the cluster: Somehow the ctrl-c option did not work for us... Similar option works fine for scald

Re: Maximum memory limits

2014-03-16 Thread Sean Owen
You should simply use a snapshot built from HEAD of github.com/apache/sparkif you can. The key change is in MLlib and with any luck you can just replace that bit. See the PR I referenced. Sure with enough memory you can get it to run even with the memory issue, but it could be hundreds of GB at yo

Re: How to kill a spark app ?

2014-03-16 Thread Mayur Rustagi
Thr is a no good way to kill jobs in Spark yet. The closest is cancelAllJobs & cancelJobGroup in spark context. I have had bugs using both. I am trying to test them out, typically you would start a different thread & call these functions on it when you wish to cancel a job. Regards Mayur Mayur Rus

Re: possible bug in Spark's ALS implementation...

2014-03-16 Thread Matei Zaharia
On Mar 14, 2014, at 5:52 PM, Michael Allman wrote: > I also found that the product and user RDDs were being rebuilt many times > over in my tests, even for tiny data sets. By persisting the RDD returned > from updateFeatures() I was able to avoid a raft of duplicate computations. > Is there a rea

Re: Contributing pyspark ports

2014-03-16 Thread Matei Zaharia
Unfortunately there isn’t a guide, but you can read a PySpark internals overview at https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. This would be the thing to follow. In terms of MLlib and GraphX, I think MLlib will be easier to expose at first — it’s designed to be easy t

Re: How to kill a spark app ?

2014-03-16 Thread Debasish Das
From http://spark.incubator.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster ./bin/spark-class org.apache.spark.deploy.Client kill does not work / has bugs ? On Sun, Mar 16, 2014 at 1:17 PM, Mayur Rustagi wrote: > Thr is a no good way to kill jobs in Spar

Re: How to kill a spark app ?

2014-03-16 Thread Mayur Rustagi
This is meant to kill the whole driver hosted inside the Master (new feature as of 0.9.0). I assume you are trying to kill a job/task/stage inside the Spark rather than the whole application. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: How to kill a spark app ?

2014-03-16 Thread Debasish Das
Thanks Mayur... I need both...but to start with even application killer will help a lot... Somehow that command did not work for meI will try it again from the spark main folder.. On Sun, Mar 16, 2014 at 1:43 PM, Mayur Rustagi wrote: > This is meant to kill the whole driver hosted inside t

Re: Maximum memory limits

2014-03-16 Thread Patrick Wendell
Sean - was this merged into the 0.9 branch as well (it seems so based on the message from rxin). If so it might make sense to try out the head of branch-0.9 as well. Unless there are *also* other changes relevant to this in master. - Patrick On Sun, Mar 16, 2014 at 12:24 PM, Sean Owen wrote: > Y

Re: Maximum memory limits

2014-03-16 Thread Sean Owen
Good point -- there's been another optimization for ALS in HEAD ( https://github.com/apache/spark/pull/131), but yes the better place to pick up just essential changes since 0.9.0 including the previous one is the 0.9 branch. -- Sean Owen | Director, Data Science | London On Sun, Mar 16, 2014 at

Re: How to kill a spark app ?

2014-03-16 Thread Mayur Rustagi
Are you embedding your driver inside the cluster? If not then that command will not kill the driver. You can simply kill the application by killing the scala application. So if its spark shell, simply by killing the shell the application will disconnect from the cluster. If the driver is embedded

Running Spark on a single machine

2014-03-16 Thread goi cto
Hi, I know it is probably not the purpose of spark but the syntax is easy and cool... I need to run some spark like code in memory on a single machine any pointers how to optimize it to run only on one machine? -- Eran | CTO

Machine Learning on streaming data

2014-03-16 Thread Nasir Khan
hi, I m into a project in which i have to get streaming URL's and Filter it and classify it as benin or suspicious. Now Machine Learning and Streaming are two separate things in apache spark (AFAIK). my Question is Can we apply Online Machine Learning Algorithms on Streams?? I am at Beginner Leve

Re: slf4j and log4j loop

2014-03-16 Thread Patrick Wendell
This is not released yet but we're planning to cut a 0.9.1 release very soon (e.g. most likely this week). In the mean time you'll have checkout branch-0.9 of Spark and publish it locally then depend on the snapshot version. Or just wait it out... On Fri, Mar 14, 2014 at 2:01 PM, Adrian Mocanu wr

Re: Running Spark on a single machine

2014-03-16 Thread Nick Pentreath
Please follow the instructions at  http://spark.apache.org/docs/latest/index.html and  http://spark.apache.org/docs/latest/quick-start.html to get started on a local machine. — Sent from Mailbox for iPhone On Sun, Mar 16, 2014 at 11:39 PM, goi cto wrote: > Hi, > I know it is probably not th

Re: How to kill a spark app ?

2014-03-16 Thread Matei Zaharia
If it’s a driver on the cluster, please open a JIRA issue about this — this kill command is indeed intended to work. Matei On Mar 16, 2014, at 2:35 PM, Mayur Rustagi wrote: > Are you embedding your driver inside the cluster? > If not then that command will not kill the driver. You can simply k

Re: [Powered by] Yandex Islands powered by Spark

2014-03-16 Thread Matei Zaharia
Thanks, I’ve added you: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Let me know if you want to change any wording. Matei On Mar 16, 2014, at 6:48 AM, Egor Pahomov wrote: > Hi, page https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark > says I need write

Re: Running Spark on a single machine

2014-03-16 Thread goi cto
Sorry, I did not explain myself correctly. I know how to run spark, the question is how to instruct spark to do all of the computation on a single machine? I was trying to convert the code to scala but I miss some of the methods of spark like reduceByKey Eran On Mon, Mar 17, 2014 at 7:25 AM, Nic

Re: Running Spark on a single machine

2014-03-16 Thread Ewen Cheslack-Postava
Those pages include instructions for running locally: "Note that all of the sample programs take a parameter specifying the cluster URL to connect to. This can be a URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should st