Re: Using String Dataset for Logistic Regression

2014-06-02 Thread Xiangrui Meng
Yes. MLlib 1.0 supports sparse input data for linear methods. -Xiangrui On Mon, Jun 2, 2014 at 11:36 PM, praveshjain1991 wrote: > I am not sure. I have just been using some numerical datasets. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Using-

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Andrew Or
>> I asked several people, no one seems to believe that we can do this: >> $ PYTHONPATH=/path/to/assembly/jar python >> >>> import pyspark That is because people usually don't package python files into their jars. For pyspark, however, this will work as long as the jar can be opened and its conten

Re: Using String Dataset for Logistic Regression

2014-06-02 Thread praveshjain1991
I am not sure. I have just been using some numerical datasets. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p6784.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: EC2 Simple Cluster

2014-06-02 Thread Akhil Das
Hi Gianluca, I believe your cluster setup wasn't complete. Do check the ec2 script console for more details. Also micro instances will be having only 600mb memory. Thanks Best Regards On Tue, Jun 3, 2014 at 1:59 AM, Gianluca Privitera < gianluca.privite...@studio.unibo.it> wrote: > Hi everyone

Re: using Log4j to log INFO level messages on workers

2014-06-02 Thread Alex Gaudio
Hi, I had the same problem with pyspark. Here's how I resolved it: What I've found in python (not sure about scala) is that if the function being serialized was written in the same python module as the main function, then logging fails. If the serialized function is in a separate module, then

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Good catch! Yes I meant 1.0 and later. On Mon, Jun 2, 2014 at 8:33 PM, Kexin Xie wrote: > +1 on Option (B) with flag to allow semantics in (A) for back compatibility. > > Kexin > > > > On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas > wrote: >> >> On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendel

Re: Window slide duration

2014-06-02 Thread Vadim Chekan
Thanks for looking into this Tathagata. Are you looking for traces of ReceiveInputDStream.clearMetadata call? Here is the log: http://wepaste.com/vchekan Vadim. On Mon, Jun 2, 2014 at 5:58 PM, Tathagata Das wrote: > Can you give all the logs? Would like to see what is clearing the key " > 14

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Kexin Xie
+1 on Option (B) with flag to allow semantics in (A) for back compatibility. Kexin On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas wrote: > On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell > wrote: > >> (B) Semantics in Spark 1.0 and earlier: > > > Do you mean 1.0 and later? > > Option (B) w

Re: Re: how to construct a ClassTag object as a method parameter in Java

2014-06-02 Thread bluejoe2008
spark 0.9.1 textInput is a JavaRDD object i am programming in Java 2014-06-03 bluejoe2008 From: Michael Armbrust Date: 2014-06-03 10:09 To: user Subject: Re: how to construct a ClassTag object as a method parameter in Java What version of Spark are you using? Also are you sure the type of tex

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell wrote: > (B) Semantics in Spark 1.0 and earlier: Do you mean 1.0 and later? Option (B) with the exception-on-clobber sounds fine to me, btw. My use pattern is probably common but not universal, and deleting user files is indeed scary. Nick

A single build.sbt file to start Spark REPL?

2014-06-02 Thread Alexy Khrabrov
The usual way to use Spark with SBT is to package a Spark project using sbt package (e.g. per Quick Start) and submit it to Spark using the bin/ scripts from Sark distribution. For plain Scala project, you don’t need to download anything, you can just get a build.sbt file with dependencies and

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu
I remember that in the earlier version of that PR, I deleted files by calling HDFS API we discussed and concluded that, it’s a bit scary to have something directly deleting user’s files in Spark Best, -- Nan Zhu On Monday, June 2, 2014 at 10:39 PM, Patrick Wendell wrote: > (A) Semantics

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
Yeah we need to add a build warning to the Maven build. Would you be able to try compiling Spark with Java 6? It would be good to narrow down if you hare hitting this problem or something else. On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen wrote: > Nope... didn't try java 6. The standard instal

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's output format check and overwrite files in the destination directory. But it won't clobber the directory entirely. I.e. if the directory already had "part1" "part2" "part3" "part4" and you write a new job outputing only two files ("p

Re: how to construct a ClassTag object as a method parameter in Java

2014-06-02 Thread Michael Armbrust
What version of Spark are you using? Also are you sure the type of textInput is a JavaRDD and not an RDD? It looks like the 1.0 Java API

Re: Failed to remove RDD error

2014-06-02 Thread Tathagata Das
Spark.streaming.unpersist was an experimental feature introduced with Spark 0.9 (but kept disabled), which actively clears off RDDs that are not useful any more. in Spark 1.0 that has been enabled by default. It is possible that this is an unintended side-effect of that. If spark.cleaner.ttl works

how to construct a ClassTag object as a method parameter in Java

2014-06-02 Thread bluejoe2008
hi,all i am programming with Spark in Java, and now i have a question: when i made a method call on a JavaRDD such as: textInput.mapPartitionsWithIndex( new Function2, Iterator>() {...}, false, PARAM3 ); what value should i pass as the PARAM3 parameter? it is required as a ClassTag value, then ho

Re: Window slide duration

2014-06-02 Thread Tathagata Das
Can you give all the logs? Would like to see what is clearing the key " 1401754908000 ms" TD On Mon, Jun 2, 2014 at 5:38 PM, Vadim Chekan wrote: > Ok, it seems like "Time ... is invalid" is part of normal workflow, when > window DStream will ignore RDDs at moments in time when they do not matc

Re: NoSuchElementException: key not found

2014-06-02 Thread Tathagata Das
Do you have the info level logs of the application? Can you grep the value "32855" to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang wrote: >

Re: Window slide duration

2014-06-02 Thread Vadim Chekan
Ok, it seems like "Time ... is invalid" is part of normal workflow, when window DStream will ignore RDDs at moments in time when they do not match to the window sliding interval. But why am I getting exception is still unclear. Here is the full stack: 14/06/02 17:21:48 INFO WindowedDStream: Time 1

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Matei Zaharia
You can just use the Maven build for now, even for Spark 1.0.0. Matei On Jun 2, 2014, at 5:30 PM, Mohit Nayak wrote: > Hey, > Yup that fixed it. Thanks so much! > > Is this the only solution, or could this be resolved in future versions of > Spark ? > > > On Mon, Jun 2, 2014 at 5:14 PM, Se

Re: Window slide duration

2014-06-02 Thread Tathagata Das
I am assuming that you are referring to the "OneForOneStrategy: key not found: 1401753992000 ms" error, and not to the previous "Time 1401753992000 ms is invalid ...". Those two seem a little unrelated to me. Can you give us the stacktrace associated with the key-not-found error? TD On Mon, Jun

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Mohit Nayak
Hey, Yup that fixed it. Thanks so much! Is this the only solution, or could this be resolved in future versions of Spark ? On Mon, Jun 2, 2014 at 5:14 PM, Sean Owen wrote: > If it's the SBT build, I suspect you are hitting > https://issues.apache.org/jira/browse/SPARK-1949 > > Can you try to a

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
Phoofff.. (Mind blown)... Thank you sir. This is awesome On Mon, Jun 2, 2014 at 5:23 PM, Marcelo Vanzin wrote: > The idea is simple. If you want to run something on a collection of > files, do (in pseudo-python): > > def processSingleFile(path): > # Your code to process a file > > files = [ "

Re: Processing audio/video/images

2014-06-02 Thread Marcelo Vanzin
The idea is simple. If you want to run something on a collection of files, do (in pseudo-python): def processSingleFile(path): # Your code to process a file files = [ "file1", "file2" ] sc.parallelize(files).foreach(processSingleFile) On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha wrote: > Hi M

Window slide duration

2014-06-02 Thread Vadim Chekan
Hi all, I am getting an error: 14/06/02 17:06:32 INFO WindowedDStream: Time 1401753992000 ms is invalid as zeroTime is 1401753986000 ms and slideDuration is 4000 ms and difference is 6000 ms 14/06/02 17:06:32 ERROR OneForOneStrategy: key not found: 1401753992000 ms ===

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
Hi Marcelo, Thanks for the response.. I am not sure I understand. Can you elaborate a bit. So, for example, lets take a look at this example http://pythonvision.org/basic-tutorial import mahotas dna = mahotas.imread('dna.jpeg') dnaf = ndimage.gaussian_filter(dna, 8) But except dna.jpeg Lets say

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
Thanks. Let me go thru it. On Mon, Jun 2, 2014 at 5:15 PM, Philip Ogren wrote: > I asked a question related to Marcelo's answer a few months ago. The > discussion there may be useful: > > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html > > > > On 06/02/2014 06:09 PM, Mar

Re: Interactive modification of DStreams

2014-06-02 Thread Tathagata Das
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started. Nor can you restart a stopped streaming context. Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.

Re: Processing audio/video/images

2014-06-02 Thread Philip Ogren
I asked a question related to Marcelo's answer a few months ago. The discussion there may be useful: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html On 06/02/2014 06:09 PM, Marcelo Vanzin wrote: Hi Jamal, If what you want is to process lots of files in parallel, the b

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Sean Owen
If it's the SBT build, I suspect you are hitting https://issues.apache.org/jira/browse/SPARK-1949 Can you try to apply the excludes you see at https://github.com/apache/spark/pull/906/files to your build to see if it resolves it? If so I think this could be helpful to commit. On Tue, Jun 3, 2014

Re: Processing audio/video/images

2014-06-02 Thread Marcelo Vanzin
Hi Jamal, If what you want is to process lots of files in parallel, the best approach is probably to load all file names into an array and parallelize that. Then each task will take a path as input and can process it however it wants. Or you could write the file list to a file, and then use sc.te

Processing audio/video/images

2014-06-02 Thread jamal sasha
Hi, How do one process for data sources other than text? Lets say I have millions of mp3 (or jpeg) files and I want to use spark to process them? How does one go about it. I have never been able to figure this out.. Lets say I have this library in python which works like following: import audi

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Mohit Nayak
Hey, Thanks for the reply. I am using SBT. Here is a list of my dependancies: val sparkCore= "org.apache.spark" % "spark-core_2.10" % V.spark val hadoopCore = "org.apache.hadoop" % "hadoop-core" % V.hadoop% "provided" val jodaTime = "com.github.nscala-time" %% "

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Sean Owen
Is there a third way? Unless I miss something. Hadoop's OutputFormat wants the target dir to not exist no matter what, so it's just a question of whether Spark deletes it for you or errors. On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell wrote: > We can just add back a flag to make it backwards

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Sean Owen
This ultimately means you have a couple copies of the servlet APIs in the build. What is your build like (SBT? Maven?) and what exactly are you depending on? On Tue, Jun 3, 2014 at 12:21 AM, Mohit Nayak wrote: > Hi, > I've upgraded to Spark 1.0.0. I'm not able to run any tests. They throw a > > j

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
We can just add back a flag to make it backwards compatible - it was just missed during the original PR. Adding a *third* set of "clobber" semantics, I'm slightly -1 on that for the following reasons: 1. It's scary to have Spark recursively deleting user files, could easily lead to users deleting

Fwd: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Mohit Nayak
Hi, I've upgraded to Spark 1.0.0. I'm not able to run any tests. They throw a *java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package* I'm using Hadoop-core 1.0.4 and running this locally.

using Log4j to log INFO level messages on workers

2014-06-02 Thread Shivani Rao
Hello Spark fans, I am trying to log messages from my spark application. When the main() function attempts to log, using log.info() it works great, but when I try the same command from the code that probably runs on the worker, I initially got an serialization error. To solve that, I created a new

NoSuchElementException: key not found

2014-06-02 Thread Michael Chang
Hi all, Seeing a random exception kill my spark streaming job. Here's a stack trace: java.util.NoSuchElementException: key not found: 32855 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collectio

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
Ah yes, this was indeed intended to have been taken care of : add some new APIs with a flag for users to define whether he/she wants to > overwrite the directory: if the flag is set to true, *then the output > direct

Interactive modification of DStreams

2014-06-02 Thread lbustelo
This is a general question about whether Spark Streaming can be interactive like batch Spark jobs. I've read plenty of threads and done my fair bit of experimentation and I'm thinking the answer is NO, but it does not hurt to ask. More specifically, I would like to be able to do: 1. Add/Remove st

Re: How to create RDDs from another RDD?

2014-06-02 Thread Andrew Ash
Hi Gerard, Usually when I want to split one RDD into several, I'm better off re-thinking the algorithm to do all the computation at once. Example: Suppose you had a dataset that was the tuple (URL, webserver, pageSizeBytes), and you wanted to find out the average page size that each webserver (e

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu
I made the PR, the problem is …after many rounds of review, that configuration part is missed….sorry about that I will fix it Best, -- Nan Zhu On Monday, June 2, 2014 at 5:13 PM, Pierre Borckmans wrote: > I'm a bit confused because the PR mentioned by Patrick seems to adress all > t

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
Fair enough. That rationale makes sense. I would prefer that a Spark clobber option also delete the destination files, but as long as it's a non-default option I can see the "caller beware" side of that argument as well. Nick 2014년 6월 2일 월요일, Sean Owen님이 작성한 메시지: > I assume the idea is for Spa

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Sean Owen
I assume the idea is for Spark to "rm -r dir/", which would clean out everything that was there before. It's just doing this instead of the caller. Hadoop still won't let you write into a location that already exists regardless, and part of that is for this reason that you might end up with files m

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
I'm a bit confused because the PR mentioned by Patrick seems to adress all these issues: https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 Was it not accepted? Or is the description of this PR not completely implemented? Message sent from a mobile device - excuse t

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
OK, thanks for confirming. Is there something we can do about that leftover part- files problem in Spark, or is that for the Hadoop team? 2014년 6월 2일 월요일, Aaron Davidson님이 작성한 메시지: > Yes. > > > On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > > So in summ

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
Nothing appears to be running on hivecluster2:8080. 'sudo jps' does show [hivedata@hivecluster2 ~]$ sudo jps 9953 PepAgent 13797 JournalNode 7618 NameNode 6574 Jps 12716 Worker 16671 RunJar 18675 Main 18177 JobTracker 10918 Master 18139 TaskTracker 7674 DataNode I kill all processes listed. I r

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Aaron Davidson
Yes. On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas wrote: > So in summary: > >- As of Spark 1.0.0, saveAsTextFile() will no longer clobber by >default. >- There is an open JIRA issue to add an option to allow clobbering. >- Even when clobbering, part- files may be left over f

EC2 Simple Cluster

2014-06-02 Thread Gianluca Privitera
Hi everyone, I would like to setup a very simple cluster (specifically using 2 micro instances only) of Spark on EC2 and make it run a simple Spark Streaming application I created. Someone actually managed to do that? Because after launching the scripts from this page: http://spark.apache.org/

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
So in summary: - As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default. - There is an open JIRA issue to add an option to allow clobbering. - Even when clobbering, part- files may be left over from previous saves, which is dangerous. Is this correct? On Mon, Jun 2, 2

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Aaron Davidson
You may have to do "sudo jps", because it should definitely list your processes. What does hivecluster2:8080 look like? My guess is it says there are 2 applications registered, and one has taken all the executors. There must be two applications running, as those are the only things that keep open

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Aaron Davidson
+1 please re-add this feature On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell wrote: > Thanks for pointing that out. I've assigned you to SPARK-1677 (I think > I accidentally assigned myself way back when I created it). This > should be an easy fix. > > On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Nope... didn't try java 6. The standard installation guide didn't say anything about java 7 and suggested to do "-DskipTests" for the build.. http://spark.apache.org/docs/latest/building-with-maven.html So, I didn't see the warning message... On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell wrot

How to create RDDs from another RDD?

2014-06-02 Thread Gerard Maas
The RDD API has functions to join multiple RDDs, such as PariRDD.join or PariRDD.cogroup that take another RDD as input. e.g. firstRDD.join(secondRDD) I'm looking for ways to do the opposite: split an existing RDD. What is the right way to create derivate RDDs from an existing RDD? e.g. imagine

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended Zip format and Java 7 uses Zip64. I think we've tried to add some build warnings if Java 7 is used, for this reason: https://github.com/apache/spark/blob/master/make-distribution.sh#L102 Any luck if you use JDK 6 to compile?

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think I accidentally assigned myself way back when I created it). This should be an easy fix. On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu wrote: > Hi, Patrick, > > I think https://issues.apache.org/jira/browse/SPARK-1677 is talking abo

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu
Hi, Patrick, I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about the same thing? How about assigning it to me? I think I missed the configuration part in my previous commit, though I declared that in the PR description…. Best, -- Nan Zhu On Monday, June 2, 20

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen wrote: > I asked several people, no one seems to believe that we can do this: > $ PYTHONPATH=/path/to/a

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Hey There, The issue was that the old behavior could cause users to silently overwrite data, which is pretty bad, so to be conservative we decided to enforce the same checks that Hadoop does. This was documented by this JIRA: https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apa

Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
I built my new package like this: "mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean package" Spark-shell is working now, but pyspark is still broken. I reported the problem on a different thread. Please take a look if you can... Desperately need ideas.. Thanks. -Simon O

Re: spark 1.0.0 on yarn

2014-06-02 Thread Patrick Wendell
Okay I'm guessing that our upstreaming "Hadoop2" package isn't new enough to work with CDH5. We should probably clarify this in our downloads. Thanks for reporting this. What was the exact string you used when building? Also which CDH-5 version are you building against? On Mon, Jun 2, 2014 at 8:11

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
I asked several people, no one seems to believe that we can do this: $ PYTHONPATH=/path/to/assembly/jar python >>> import pyspark This following pull request did mention something about generating a zip file for all python related modules: https://www.mail-archive.com/reviews@spark.apache.org/msg0

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
If it matters, I have servers running at http://hivecluster2:4040/stages/ and http://hivecluster2:4041/stages/ When I run rdd.first, I see an item at http://hivecluster2:4041/stages/ but no tasks are running. Stage ID 1, first at :46, Tasks: Succeeded/Total 0/16. On Mon, Jun 2, 2014 at 10:09 AM,

Re: [Spark Streaming] Distribute custom receivers evenly across excecutors

2014-06-02 Thread Guang Gao
The receivers are submitted as tasks. They are supposed to be assigned to the executors in a round-robin manner by TaskSchedulerImpl.resourceOffers(). However, sometimes not all the executors are registered when the receivers are submitted. That's why the receivers fill up the registered executors

Is Hadoop MR now comparable with Spark?

2014-06-02 Thread Ian Ferreira
http://hortonworks.com/blog/ddm/#.U4yn3gJgfts.twitter

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
Looks like just worker and master processes are running: [hivedata@hivecluster2 ~]$ jps 10425 Jps [hivedata@hivecluster2 ~]$ ps aux|grep spark hivedata 10424 0.0 0.0 103248 820 pts/3S+ 10:05 0:00 grep spark root 10918 0.5 1.4 4752880 230512 ? Sl May27 41:43 java -cp :

Re: Failed to remove RDD error

2014-06-02 Thread Michael Chang
Hey Mayur, Thanks for the suggestion, I didn't realize that was configurable. I don't think I'm running out of memory, though it does seem like these errors go away when i turn off the spark.streaming.unpersist configuration and use spark.cleaner.ttl instead. Do you know if there are known issue

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
Indeed, the behavior has changed for good or for bad. I mean, I agree with the danger you mention but I'm not sure it's happening like that. Isn't there a mechanism for overwrite in Hadoop that automatically removes part files, then writes a _temporary folder and then only the part files along w

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers, it seems that the jar file is distributed and included in classpath correctly. I think the problem is likely at step 3.. I build my jar file with maven, like this: "mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
1) yes, that sc.parallelize(range(10)).count() has the same error. 2) the files seem to be correct 3) I have trouble at this step, "ImportError: No module named pyspark" but I seem to have files in the jar file: """ $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python >>> import py

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Andrew Or
Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try: 1) Run a simple pyspark shell with yarn-client, and do

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
What I’ve found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is that files get overwritten automatically. This is one danger to this though. If I save to a directory that already has 20 part- files, but this time around I’m only saving 15 part- files, then there will be 5 leftover part

Re: Trouble with EC2

2014-06-02 Thread Stefan van Wouw
Dear PJ$, If you are familiar with Puppet, you could try using the puppet module I wrote (currently for Spark 0.9.0, I custom compiled it since no Debian package was available at the time I started with a project I required it for). https://github.com/stefanvanwouw/puppet-spark --- Kind regard

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-06-02 Thread Andrei
Thanks! This is even closer to what I am looking for. I'm in a trip now, so I'm going to give it a try when I come back. On Mon, Jun 2, 2014 at 5:12 AM, Ngoc Dao wrote: > Alternative solution: > https://github.com/xitrum-framework/xitrum-package > > It collects all dependency .jar files in your

pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client

Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
OK, rebuilding the assembly jar file with cdh5 works now... Thanks.. -Simon On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen wrote: > That helped a bit... Now I have a different failure: the start up process > is stuck in an infinite loop outputting the following message: > > 14/06/02 01:34:56

Re: Using String Dataset for Logistic Regression

2014-06-02 Thread Wush Wu
Dear all, Does spark support sparse matrix/vector for LR now? Best, Wush 2014/6/2 下午3:19 於 "praveshjain1991" 寫道: > Thank you for your replies. I've now been using integer datasets but ran > into > another issue. > > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-proce

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre B
Hi Michaël, Thanks for this. We could indeed do that. But I guess the question is more about the change of behaviour from 0.9.1 to 1.0.0. We never had to care about that in previous versions. Does that mean we have to manually remove existing files or is there a way to "aumotically" overwrite wh

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Michael Cutler
The function saveAsTextFile is a wrapper around saveAsHadoopFile

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
+1 Same question here... Message sent from a mobile device - excuse typos and abbreviations > Le 2 juin 2014 à 10:08, Kexin Xie a écrit : > > Hi, > > Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw > org.apache.hadoop.mapred.FileAlreadyExistsException when file already

How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Kexin Xie
Hi, Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw org.apache.hadoop.mapred.FileAlreadyExistsException when file already exists. Is there a way I can allow Spark to overwrite the existing file? Cheers, Kexin

Re: Using String Dataset for Logistic Regression

2014-06-02 Thread praveshjain1991
Thank you for your replies. I've now been using integer datasets but ran into another issue. http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-td6694.html Any ideas? -- Thanks -- View this message in context: http://apac

Spark Streaming not processing file with particular number of entries

2014-06-02 Thread praveshjain1991
Hi, I am using spark-streaming application to process some data over a 3 node cluster. It is, however, not processing any file that contains 0.4 million entires. Files with any other number of entries are processed fine. When running in local mode, even the 0.4 million entries file is processed fi