Feed KMeans algorithm with a row major matrix

2014-03-18 Thread Jaonary Rabarisoa
Dear All, I'm trying to cluster data from native library code with Spark Kmeans||. In my native library the data are represented as a matrix (row = number of data and col = dimension). For efficiency reason, they are copied into a one dimensional scala Array row major wise so after the computation

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-18 Thread dmpour23
On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote: > Is there a reason for spark using the older akka? > > > > > On Sun, Mar 2, 2014 at 1:53 PM, 1esha wrote: > > The problem is in akka remote. It contains files compiled with 2.4.*. When > > you run it with 2.5.* in classpath i

Connect Exception Error in spark interactive shell...

2014-03-18 Thread Sai Prasanna
Hi ALL !! In the interactive spark shell i get the following error. I just followed the steps of the video "First steps with spark - spark screen cast #1" by andy konwinski... Any thoughts ??? scala> val textfile = sc.textFile("README.md") textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[1

Re: Connect Exception Error in spark interactive shell...

2014-03-18 Thread Sourav Chandra
Not sure whether this is related to https://github.com/amplab/docker-scripts/issues/24 On Tue, Mar 18, 2014 at 3:29 PM, Sai Prasanna wrote: > Hi ALL !! > > In the interactive spark shell i get the following error. > I just followed the steps of the video "First steps with spark - spark > screen

Re: Apache Spark 0.9.0 Build Error

2014-03-18 Thread wapisani
I tried that command on Fedora and I got a lot of random downloads (around 250 downloads) and it appeared that something was trying to get BitTorrent start. That command "./sbt/sbt assembly" doesn't work on Windows. I installed sbt separately. Is there a way to determine if I'm using the sbt that'

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-18 Thread Ognen Duzlevski
On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote: On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote: Is there a reason for spark using the older akka? On Sun, Mar 2, 2014 at 1:53 PM, 1esha wrote: The problem is in akka remote. It contains files compiled with 2.4.*. When you r

Re: example of non-line oriented input data?

2014-03-18 Thread Diana Carroll
Thanks, Matei. In the context of this discussion, it would seem mapParitions is essential, because it's the only way I'm going to be able to process each file as a whole, in our example of a large number of small XML files which need to be parsed as a whole file because records are not required to

KryoSerializer return null when deserialize Task obj in Executor

2014-03-18 Thread 林武康
Hi all, I changed spark.closure.serializer to kryo, when I try count action in spark shell the Task obj deserialize in Executor return null, src line is: override def run(){ .. task = ser.deserializer[Task[Any]](...) .. } Where task is null Can any one help me? Thank you!

Re: example of non-line oriented input data?

2014-03-18 Thread Diana Carroll
Well, if anyone is still following this, I've gotten the following code working which in theory should allow me to parse whole XML files: (the problem was that I can't return the tree iterator directly. I have to call iter(). Why?) import xml.etree.ElementTree as ET # two source files, format

Re: Apache Spark 0.9.0 Build Error

2014-03-18 Thread Robin Cjc
hi, if you run that under windows, you should use "\" to replace "/". sbt/sbt means the sbt file under the sbt folder. On Mar 18, 2014 8:42 PM, "wapisani" wrote: > I tried that command on Fedora and I got a lot of random downloads (around > 250 downloads) and it appeared that something was trying

Re: Apache Spark 0.9.0 Build Error

2014-03-18 Thread wapisani
Hi Chen, I tried "sbt\sbt assembly" and I got an error of " 'sbt\sbt' is not recognized as an internal or external command, operable program or batch file." On Tue, Mar 18, 2014 at 11:18 AM, Chen Jingci [via Apache Spark User List] < ml-node+s1001560n2811...@n3.nabble.com> wrote: > hi, if you

Re: Connect Exception Error in spark interactive shell...

2014-03-18 Thread Mayur Rustagi
Your hdfs is down. Probably forgot to format namenode. check if namenode is running ps -aef|grep Namenode if not & data in hdfs is not critical hadoop namenode -format & restart hdfs Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Separating classloader management from SparkContexts

2014-03-18 Thread Punya Biswal
Hi Spark people, Sorry to bug everyone again about this, but do people have any thoughts on whether sub-contexts would be a good way to solve this problem? I'm thinking of something like class SparkContext { // ... stuff ... def inSubContext[T](fn: SparkContext => T): T } this way, I could d

[spark] New article on spark & scalaz-stream (& a bit of ML)

2014-03-18 Thread Pascal Voitot Dev
Hi, I wrote this new article after studying deeper how to adapt scalaz-stream to spark dstreams. I re-explain a few spark (& scalaz-stream) concepts (in my "own" words) in it and I went further using new scalaz-stream NIO API which is quite interesting IMHO. The result is a long blog tryptic start

Re: Running spark examples/scala scripts

2014-03-18 Thread Mayur Rustagi
print out the last line & run it outside on the shell :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, Mar 18, 2014 at 2:37 AM, Pariksheet Barapatre wrote: > Hello all, > > I am trying to run shipped in example wi

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Michael Allman
Hi Xiangrui, I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can you explain? Also, thanks for addressing the issue with factor matrix persistence in PR 165. I was probably not going to get to that for a while. I will try to test your changes today for speed improvements

Re: inexplicable exceptions in Spark 0.7.3

2014-03-18 Thread Walrus theCat
Hi Andrew, Thanks for your interest. This is a standalone job. On Mon, Mar 17, 2014 at 4:30 PM, Andrew Ash wrote: > Are you running from the spark shell or from a standalone job? > > > On Mon, Mar 17, 2014 at 4:17 PM, Walrus theCat wrote: > >> Hi, >> >> I'm getting this stack trace, using Spa

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Xiangrui Meng
Sorry, the link was wrong. Should be https://github.com/apache/spark/pull/131 -Xiangrui On Tue, Mar 18, 2014 at 10:20 AM, Michael Allman wrote: > Hi Xiangrui, > > I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can > you explain? > > Also, thanks for addressing the issue

Re: Feed KMeans algorithm with a row major matrix

2014-03-18 Thread Xiangrui Meng
Hi Jaonary, With the current implementation, you need to call Array.slice to make each row an Array[Double] and cache the result RDD. There is a plan to support block-wise input data and I will keep you informed. Best, Xiangrui On Tue, Mar 18, 2014 at 2:46 AM, Jaonary Rabarisoa wrote: > Dear Al

Re: spark-shell fails

2014-03-18 Thread psteckler
Although "sbt assembly" reports success, I re-ran that step, and see errors like: Error extracting zip entry 'scala/tools/nsc/transformUnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply (omitting rest of super-long path) (File name too long) Is this a problem with the 'zip' tool on my sys

Re: spark-shell fails

2014-03-18 Thread psteckler
OK, the problem was that the directory where I had installed Spark is encrypted. The particular encryption system appears to limit the length of files. I re-installed on a vanilla partition, and spark-shell runs fine. -- View this message in context: http://apache-spark-user-list.1001560.n3.n

Maven repo for Spark pre-built with CDH4?

2014-03-18 Thread Punya Biswal
Hi all, The Maven central repo contains an artifact for spark 0.9.0 built with unmodified Hadoop, and the Cloudera repo contains an artifact for spark 0.9.0 built with CDH 5 beta. Is there a repo that contains spark-core built against a non-beta version of CDH (such as 4.4.0)? Punya smime.p7

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Michael Allman
I just ran a runtime performance comparison between 0.9.0-incubating and your als branch. I saw a 1.5x improvement in performance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html Sent from the Apach

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Xiangrui Meng
Glad to hear the speed-up. Wish we can improve the implementation further in the future. -Xiangrui On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman wrote: > I just ran a runtime performance comparison between 0.9.0-incubating and your > als branch. I saw a 1.5x improvement in performance. > > > >

Regarding Successive operation on elements and recursively

2014-03-18 Thread yh18190
Hi , >I am new to Spark scala environment.Currently I am working on Discrete wavelet transformation algos on time series data. > I have to perform recursive additions on successive elements in RDDs. > for example > List of elements(RDDS) --a1 a2 a3 a4. > level1 Tranformation --a1+a2 a3+a4 a

Re: Apache Spark 0.9.0 Build Error

2014-03-18 Thread x
I tried to build 0.9.0 on Windows&Cygwin yesterday and it passed. Did you launch it on Cygwin? -xj On Wed, Mar 19, 2014 at 12:42 AM, wapisani wrote: > Hi Chen, > > I tried "sbt\sbt assembly" and I got an error of " 'sbt\sbt' is not > recognized as an internal or external command, operable progr

Re: example of non-line oriented input data?

2014-03-18 Thread Matei Zaharia
Hi Diana, This seems to work without the iter() in front if you just return treeiterator. What happened when you didn’t include that? Treeiterator should return an iterator. Anyway, this is a good example of mapPartitions. It’s one where you want to view the whole file as one object (one XML h

Re: Incrementally add/remove vertices in GraphX

2014-03-18 Thread Matei Zaharia
I just meant that you call union() before creating the RDDs that you pass to new Graph(). If you call it after it will produce other RDDs. The Graph() constructor actually shuffles and “indexes” the data to make graph operations efficient, so it’s not too easy to add elements after. You could a

Re: Maven repo for Spark pre-built with CDH4?

2014-03-18 Thread Rob Povey
FWIW after searching for the same libraryI had to build spark it to get it to work with HDFS on a cloudera install I downloaded the CDH version from the spark site and still had to build it to get it to work, this is the command I used SPARK_HADOOP_VERSION=2.0.0-cdh4.6.0 sbt/sbt assembly SPARK

Access original filename in a map function

2014-03-18 Thread Uri Laserson
Hi spark-folk, I have a directory full of files that I want to process using PySpark. There is some necessary metadata in the filename that I would love to attach to each record in that file. Using Java MapReduce, I would access (FileSplit) context.getInputSplit()).getPath().getName() in the s

Re: example of non-line oriented input data?

2014-03-18 Thread Matei Zaharia
BTW one other thing — in your experience, Diana, which non-text InputFormats would be most useful to support in Python first? Would it be Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or something else? I think a per-file text input format that does the stuff we did here

Pyspark worker memory

2014-03-18 Thread Jim Blomo
Hello, I'm using the Github snapshot of PySpark and having trouble setting the worker memory correctly. I've set spark.executor.memory to 5g, but somewhere along the way Xmx is getting capped to 512M. This was not occurring with the same setup and 0.9.0. How many places do I need to configure the m

Re: Incrementally add/remove vertices in GraphX

2014-03-18 Thread Ankur Dave
As Matei said, there's currently no support for incrementally adding vertices or edges to their respective partitions. Doing this efficiently would require extensive modifications to GraphX, so for now, the only options are to rebuild the indices on every graph modification, or to use the subgraph

Re: There is an error in Graphx

2014-03-18 Thread ankurdave
This problem occurs because graph.triplets generates an iterator that reuses the same EdgeTriplet object for every triplet in the partition. The workaround is to force a copy using graph.triplets.map(_.copy()). The solution in the AMPCamp tutorial is mistaken -- I'm not sure if that ever worked.

Re: There is an error in Graphx

2014-03-18 Thread ankurdave
> The workaround is to force a copy using graph.triplets.map(_.copy()). Sorry, this actually won't copy the entire triplet, only the attributes defined in Edge. The right workaround is to copy the EdgeTriplet explicitly: graph.triplets.map { et => val et2 = new EdgeTriplet[VD, ED] // Replace

Re: Are there any plans to develop Graphx Streaming?

2014-03-18 Thread ankurdave
Yes, Joey Gonzalez and I are working on a streaming version of GraphX. It's not usable yet, but we will announce when an alpha is ready, likely in a few months. Ankur -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-there-any-plans-to-develop-Graphx-Stre

Re: sample data for pagerank?

2014-03-18 Thread ankurdave
The examples in graphx/data are meant to show the input data format, but if you want to play around with larger and more interesting datasets, we've been using the following ones, among others: - SNAP's web-Google dataset (5M edges): https://snap.stanford.edu/data/web-Google.html - SNAP's soc-Live

Spark enables us to process Big Data on an ARM cluster !!

2014-03-18 Thread Chanwit Kaewkasi
Hi all, We are a small team doing a research on low-power (and low-cost) ARM clusters. We built a 20-node ARM cluster that be able to start Hadoop. But as all of you've known, Hadoop is performing on-disk operations, so it's not suitable for a constraint machine powered by ARM. We then switched t

Re: Log analyzer and other Spark tools

2014-03-18 Thread Patrick Wendell
Hey Roman, Ya definitely checkout pull request 42 - one cool thing is this patch now includes information about in-memory storage in the listener interface, so you can see directly which blocks are cached/on-disk etc. - Patrick On Mon, Mar 17, 2014 at 5:34 PM, Matei Zaharia wrote: > Take a look

Re: Separating classloader management from SparkContexts

2014-03-18 Thread Andrew Ash
Hi Punya, This seems like a problem that the recently-announced job-server would likely have run into at one point. I haven't tested it yet, but I'd be interested to see what happens when two jobs in the job server have conflicting classes. Does the server correctly segregate each job's classes

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Nick Pentreath
Great work Xiangrui thanks for the enhancement!— Sent from Mailbox for iPhone On Wed, Mar 19, 2014 at 12:08 AM, Xiangrui Meng wrote: > Glad to hear the speed-up. Wish we can improve the implementation > further in the future. -Xiangrui > On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman wrote: >>

Re: Pyspark worker memory

2014-03-18 Thread Matei Zaharia
Try checking spark-env.sh on the workers as well. Maybe code there is somehow overriding the spark.executor.memory setting. Matei On Mar 18, 2014, at 6:17 PM, Jim Blomo wrote: > Hello, I'm using the Github snapshot of PySpark and having trouble setting > the worker memory correctly. I've set

Re: Running spark examples/scala scripts

2014-03-18 Thread Pariksheet Barapatre
:-) Thanks for suggestion. I was actually asking how to run spark scripts as a standalone App. I am able to run Java code and Python code as standalone app. one more doubt, documentation says - to read HDFS file, we need to add dependency org.apache.hadoop hadoop-client 1.0.1 How to know HDFS