Re: worker keeps getting disassociated upon a failed job spark version 0.90

2014-03-17 Thread yukang chen
I have met the same problem on spark 0.9. Master lost all of the workers, because the work's heartbeat is timeout. And master show "Registering worker 10.2.6.134:56158 with 24 cores, 32.0 GB RAM" . But master didn't add restarted workerid to workerset. On Thu, Feb 27, 2014 at 8:14 AM, Shirish wr

Re: combining operations elegantly

2014-03-17 Thread Richard Siebeling
Patrick, Koert, I'm also very interested in these examples, could you please post them if you find them? thanks in advance, Richard On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers wrote: > not that long ago there was a nice example on here about how to combine > multiple operations on a single

Spark shell exits after 1 min

2014-03-17 Thread Sai Prasanna
Hi everyone !! I installed scala 2.9.3, spark 0.8.1, oracle java 7... I launched master and logged on to the interactive spark shell MASTER=spark://localhost:7077 ./spark-shell But after one minute, automatically it exits from the interactive shell... Is there something i am missing...Do i need

Re: Spark shell exits after 1 min

2014-03-17 Thread Sai Prasanna
Solved...but dont know whats the difference... just giving ./spark-shell fixes it all...but dont know why !! On Mon, Mar 17, 2014 at 1:32 PM, Sai Prasanna wrote: > Hi everyone !! > > I installed scala 2.9.3, spark 0.8.1, oracle java 7... > > I launched master and logged on to the interactive sp

Re: Spark shell exits after 1 min

2014-03-17 Thread Sourav Chandra
Hi Sai, Simple ./spark-shell start spark application in "local" mode and it will work always (hopefully :)) When you specify spark url i.e. "spark://<>" It means you are running against spark standalone cluster and it implies spark master must be running on the given location (spark://localhost:7

Question about RDD creations in Spark

2014-03-17 Thread 王永春
Hello. I have a question about RDD creations in Spark. When will a new RDD be created? I got a initial RDD from the hadoopRDD method of SparkContext and do a count action on it. After that I could examine the RDD from the driver program's webui page. Then I do a flatMap transformation on the in

Log analyzer and other Spark tools

2014-03-17 Thread Roman Pastukhov
Hi. We're thinking about writing a tool that would read Spark logs and output cache contents at some point in time (e.g. if you want to see what data fills the cache and whether some of it may be unpersisted to improve performance). Are there similar projects that already exist? Is there a list o

example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana

Re: example of non-line oriented input data?

2014-03-17 Thread Nicholas Chammas
There was a previous discussion about this here: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html How big are the XML or JSON files you're looking to deal with? It may not be practical to deserialize the entire document at once. In that case an obviou

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
I don't actually have any data. I'm writing a course that teaches students how to do this sort of thing and am interested in looking at a variety of real life examples of people doing things like that. I'd love to see some working code implementing the "obvious work-around" you mention...do you h

Efficiently map external array data (OpenCV) to spark array

2014-03-17 Thread Jaonary Rabarisoa
Hi all, I'm currently trying to use Spark Kmean|| to cluster keypoints descriptor extracted using OpenCV. OpenCV extract and store descriptors insinde Mat object. I convert Mat object into Array[Array[Double]] with the following code : (0 until descriptors.rows).map { i => val row = new A

Re: example of non-line oriented input data?

2014-03-17 Thread Krakna H
Katrina, Not sure if this is what you had in mind, but here's some simple pyspark code that I recently wrote to deal with JSON files. from pyspark import SparkContext, SparkConf from operator import add import json import random import numpy as np def concatenate_paragraphs(sentence_array):

Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Xiangrui Meng
The factor matrix Y is used twice in implicit ALS computation, one to compute global Y^T Y, and another to compute local Y_i^T C_i Y_i. -Xiangrui On Sun, Mar 16, 2014 at 1:18 PM, Matei Zaharia wrote: > On Mar 14, 2014, at 5:52 PM, Michael Allman wrote: > > I also found that the product and user

Re: Incrementally add/remove vertices in GraphX

2014-03-17 Thread Alessandro Lulli
Hi All, Is somebody looking into this? I think this is correlated with the discussion "Are there any plans to develop Graphx Streaming?". Using union / subtract on VertexRDD or EdgeRDD leads on the creation of new RDD but NOT in the modification of the RDD in the graph. Is creating a new graph th

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
Thanks, Krakna, very helpful. The way I read the code, it looks like you are assuming that each line in foo.log contains a complete json object? (That is, that the data doesn't contain any records that are split into multiple lines.) If so, is that because you know that to be true of your data?

Running spark examples

2014-03-17 Thread Chengi Liu
Hi, I compiled the spark examples and I see that there are couple of jars spark-examples_2.10-0.9.0-incubating-sources.jar spark-examples_2.10-0.9.0-incubating.jar If I want to run an example using these jars, which one should I run and how do i run them? Thanks

Re: Running spark examples

2014-03-17 Thread Matei Zaharia
Look at the “running the examples” section of http://spark.incubator.apache.org/docs/latest/index.html, there’s a script to do it. On Mar 17, 2014, at 9:55 AM, Chengi Liu wrote: > Hi, > I compiled the spark examples and I see that there are couple of jars > spark-examples_2.10-0.9.0-incubat

Re: example of non-line oriented input data?

2014-03-17 Thread Krakna H
Diana, that's correct (and I apologize for calling you Katrina mistakenly in my earlier e-mail) -- I had to do some kind of pre-processing to split up the original JSON object, although this is not that hard. Especially if your JSON data is coming from something like Mongodb where you can just spew

Re: Running spark examples

2014-03-17 Thread Chengi Liu
Hi, Thanks for the quick response.. Is there a simple way to write and deploy apps on spark. import org.apache.spark.SparkContext; import org.apache.spark.SparkContext._; object HelloWorld { def main(args: Array[String]) { println("Hello, world!") val sc = new SparkContext("local

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
Hi Diana, Non-text input formats are only supported in Java and Scala right now, where you can use sparkContext.hadoopFile or .hadoopDataset to load data with any InputFormat that Hadoop MapReduce supports. In Python, you unfortunately only have textFile, which gives you one record per line. Fo

Re: example of non-line oriented input data?

2014-03-17 Thread Nicholas Chammas
Hmm, so I lucked out with my data source in that it comes to me as line-delimited JSON, so I didn't have to write code to massage it into that format. If you are prepared to make several assumptions about your data (let's say it's JSON), it should be straightforward to write some kind of pre-proce

sbt assembly fails

2014-03-17 Thread Chengi Liu
Hi, I am trying to compile the spark project using sbt/sbt assembly.. And i see this error: [info] Resolving io.netty#netty-all;4.0.13.Final ... [error] Server access Error: Connection timed out url= https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all

java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread anny9699
Hi, I met this exception when computing new RDD from an existing RDD or using .count on some RDDs. The following is the situation: val DD1=D.map(d => { (d._1,D.map(x => math.sqrt(x._2*d._2)).toArray) }) DD is in the format RDD[(Int,Double)] and the error message is: org.apache.spark.SparkExcept

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
Thanks Matei. That makes sense. I have here a dataset of many many smallish XML files, so using mapPartitions that way would make sense. I'd love to see a code example though ...It's not as obvious to me how to do that as I probably should be. Thanks, Diana On Mon, Mar 17, 2014 at 1:02 PM, Ma

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
Here’s an example of getting together all lines in a file as one string: $ cat dir/a.txt Hello world! $ cat dir/b.txt What's up?? $ bin/pyspark >>> files = sc.textFile(“dir”) >>> files.collect() [u'Hello', u'world!', u"What's", u'up??’] # one element per line, not what we want >>> files.g

Re: sbt assembly fails

2014-03-17 Thread Mayur Rustagi
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E You also have to specify git proxy as code may be copied off git also. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
"There's also mapPartitions, which gives you an iterator for each partition instead of an array. You can then return an iterator or list of objects to produce from that." I confess, I was hoping for an example of just that, because i've not yet been able to figure out how to use mapPartitions. No

is collect exactly-once?

2014-03-17 Thread Adrian Mocanu
Hi Quick question here, I know that .foreach is not idempotent. I am wondering if collect() is idempotent? Meaning that once I've collect()-ed if spark node crashes I can't get the same values from the stream ever again. Thanks -Adrian

Re: sbt assembly fails

2014-03-17 Thread Chengi Liu
I have set it up.. still it fails.. Question: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/ 4.0.13 is not there? Instead 4.0.18 is there?? Is this a bug

Problem when execute spark-shell

2014-03-17 Thread Yexi Jiang
Hi, I am a beginner of Spark. Currently I am trying to install spark on my laptop. I followed the tutorial at http://spark.apache.org/screencasts/1-first-steps-with-spark.html (The only difference is that I installed scala-2.10.1 instead of 2.9.2). I packaged spark successfully with "sbt package

Re: links for the old versions are broken

2014-03-17 Thread Walrus theCat
On Thu, Mar 13, 2014 at 11:05 AM, Aaron Davidson wrote: > Looks like everything from 0.8.0 and before errors similarly (though > "Spark 0.3 for Scala 2.9" has a malformed link as well). > > > On Thu, Mar 13, 2014 at 10:52 AM, Walrus theCat wrote: > >> Sup, >> >> Where can I get Spark 0.7.3? I

Re: is collect exactly-once?

2014-03-17 Thread Matei Zaharia
Yup, it only returns each value once. Matei On Mar 17, 2014, at 1:14 PM, Adrian Mocanu wrote: > Hi > Quick question here, > I know that .foreach is not idempotent. I am wondering if collect() is > idempotent? Meaning that once I’ve collect()-ed if spark node crashes I can’t > get the same va

Re: java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread Andrew Ash
It looks like you're trying to access an RDD ("D") from inside a closure -- the parameter to the first map) which isn't possible with the current implementation of Spark. Can you rephrase to not access D from inside the map call? On Mon, Mar 17, 2014 at 10:36 AM, anny9699 wrote: > Hi, > > I me

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
Oh, I see, the problem is that the function you pass to mapPartitions must itself return an iterator or a collection. This is used so that you can return multiple output records for each input record. You can implement most of the existing map-like operations in Spark, such as map, filter, flatM

Re: java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread anny9699
Hi Andrew, Thanks for the reply. However I did almost the same thing in another closure: val simi=dataByRow.map(point => { val corrs=dataByRow.map(x => arrCorr(point._2,x._2)) (point._1,corrs) }) here dataByRow is of format RDD[(Int,Array[Double])] and arrCorr is a function that I wrote to compu

inexplicable exceptions in Spark 0.7.3

2014-03-17 Thread Walrus theCat
Hi, I'm getting this stack trace, using Spark 0.7.3. No references to anything in my code, never experienced anything like this before. Any ideas what is going on? java.lang.ClassCastException: spark.SparkContext$$anonfun$9 cannot be cast to scala.Function2 at spark.scheduler.ResultTask$.de

Re: inexplicable exceptions in Spark 0.7.3

2014-03-17 Thread Andrew Ash
Are you running from the spark shell or from a standalone job? On Mon, Mar 17, 2014 at 4:17 PM, Walrus theCat wrote: > Hi, > > I'm getting this stack trace, using Spark 0.7.3. No references to > anything in my code, never experienced anything like this before. Any > ideas what is going on? > >

Re: Incrementally add/remove vertices in GraphX

2014-03-17 Thread Adam Novak
I would assume that, regardless of the efficiency of such an operation, any method of adding or removing vertices would need to result in a new graph, since graphs in GraphX are supposed to be immutable. It sounds like what you probably want is an efficient union/subtract/whatever that operates on

Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Michael Allman
You are correct, in the long run it doesn't matter which matrix you begin the iterative process with. I was thinking in terms of doing a side-by-side comparison to Oryx. I've posted a bug report as SPARK-1262. I described the problem I found and the mitigation strategy I've used. I think that this

Re: java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread Ian O'Connell
I'm guessing the other result was wrong, or just never evaluated here. The RDD transforms being lazy may have let it be expressed, but it wouldn't work. Nested RDD's are not supported. On Mon, Mar 17, 2014 at 4:01 PM, anny9699 wrote: > Hi Andrew, > > Thanks for the reply. However I did almost t

Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Michael Allman
I've created https://spark-project.atlassian.net/browse/SPARK-1263 to address the issue of the factor matrix recomputation. I'm planning to submit a related pull request shortly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-imp

Re: links for the old versions are broken

2014-03-17 Thread Matei Zaharia
Thanks for reporting this, looking into it. On Mar 17, 2014, at 2:44 PM, Walrus theCat wrote: > > > > On Thu, Mar 13, 2014 at 11:05 AM, Aaron Davidson wrote: > Looks like everything from 0.8.0 and before errors similarly (though "Spark > 0.3 for Scala 2.9" has a malformed link as well). >

Re: Log analyzer and other Spark tools

2014-03-17 Thread Matei Zaharia
Take a look at the SparkListener API included in Spark, you can use it to capture various events. There’s also this pull request: https://github.com/apache/spark/pull/42 that will persist application logs and let you rebuild the web UI after the app runs. It uses the same API to log events. Ma

Re: sbt assembly fails

2014-03-17 Thread Sean Owen
It's in the main Maven repo: http://central.maven.org/maven2/io/netty/netty-all/ I assume you're seeing errors accessing all repos? the last few you quote are not where they are intended to be, you're just seeing it fail through all of them. I think it remains a connectivity problem from your env

Re: Problem when execute spark-shell

2014-03-17 Thread Shivani Rao
I am new and i don't know much either. But this is what helped me. a) Check if the compiled jar is in /spark-0.9.0-incubating/assembly/target/scala-2.10.1/ b) Try sbt package command c) spark-shell will only run from the root of the spark-0.9.0-incubating directory. I think the path of the shell s

Trouble getting hadoop and spark run along side on my vm

2014-03-17 Thread Shivani Rao
>From what i understand getting Spark to run alongside a hadoop cluster requires the following a) a working hadoop b) a compiled Spark c) configuration parameters that point spark to the right hadoop conf files i ) Can you let me know the specific steps to take after spark was compiled (via sbt a

Re: Problem when execute spark-shell

2014-03-17 Thread Debasish Das
You need the spark assembly jar to run spark shellPlease do sbt assembly to generate the jar On Mar 17, 2014 2:11 PM, "Yexi Jiang" wrote: > Hi, > > I am a beginner of Spark. > Currently I am trying to install spark on my laptop. > > I followed the tutorial at > http://spark.apache.org/scr

Re: Problem when execute spark-shell

2014-03-17 Thread Yexi Jiang
Thanks all! I figured it out... I thought sbt package is enough... 2014-03-17 21:46 GMT-04:00 Debasish Das : > You need the spark assembly jar to run spark shellPlease do sbt > assembly to generate the jar > On Mar 17, 2014 2:11 PM, "Yexi Jiang" wrote: > >> Hi, >> >> I am a beginner

Spark 0.9.0-incubation + Apache Hadoop 2.2.0 + YARN encounter Compression codec com.hadoop.compression.lzo.LzoCodec not found

2014-03-17 Thread Andrew Lee
Hi All, I have been contemplating at this problem and couldn't figure out what is missing in the configuration. I traced the script and try to look for CLASSPATH and see what is included, however, I couldn't find any place that is honoring/inheriting HADOOP_CLASSPATH (or pulling in any map-reduce

Apache Spark 0.9.0 Build Error

2014-03-17 Thread wapisani
Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8. I've installed all prerequisites (except Hadoop) and run "sbt/sbt assembly" while in the root directory. I'm getting an error after the line "Set current project to root ". The error is: [error] Not a valid command: / [error] /s

Re: Apache Spark 0.9.0 Build Error

2014-03-17 Thread Mark Hamstra
Try ./sbt/sbt assembly On Mon, Mar 17, 2014 at 9:06 PM, wapisani wrote: > Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8. I've > installed all prerequisites (except Hadoop) and run "sbt/sbt assembly" > while > in the root directory. I'm getting an error after the line "Se

Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Xiangrui Meng
Hi Michael, I made couple changes to implicit ALS. One gives faster construction of YtY (https://github.com/apache/spark/pull/161), which was merged into master. The other caches intermediate matrix factors properly (https://github.com/apache/spark/pull/165). They should give you the same result a

Re: sbt assembly fails

2014-03-17 Thread Chengi Liu
Hi Sean, Yeah.. I am seeing erros across all repos and yepp.. this error is mainly because of connectivity issue... How do I set up proxy.. I did set up proxy as suggested by Mayur: export JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=yourserver -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.pr

Re: sbt assembly fails

2014-03-17 Thread Mayur Rustagi
is it translating to sbt? are you also setting command line proxy HTTP_PROXY easiest is to build a small code & just test it out by building in command line.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, Mar 18, 2

Re: sbt assembly fails

2014-03-17 Thread Chengi Liu
Yeah.. The http_proxy is set up.. and so is https_proxy.. Basically, my maven projects, git pulls etc everything is working fine.. except this. Here is another question which might help me to bypass this issue If I create a jar using eclipse... how do i run that jar in code. Like in hadoop, I

Running spark examples/scala scripts

2014-03-17 Thread Pariksheet Barapatre
Hello all, I am trying to run shipped in example with spark i.e. in example directory. [cloudera@aster2 examples]$ ls bagel ExceptionHandlingTest.scala HdfsTest2.scala LocalKMeans.scala MultiBroadcastTest.scala SparkHdfsLR.scala SparkPi.scala BroadcastTest.scala

Re: sbt assembly fails

2014-03-17 Thread Mayur Rustagi
you need to assemble the code to get spark working (unless you are using hadoop 1.0.4). to run the code you can follow any of the standalone guides here: https://spark.apache.org/docs/0.9.0/quick-start.html#a-standalone-app-in-scalayou would still need sbt though. Mayur Rustagi Ph: +1 (760) 203