Re: How to index each map operation????

2014-04-01 Thread Shixiong Zhu
Hi Thierry, Your code does not work if @yh18190 wants a global counter. A RDD may have more than one partition. For each partition, cnt will be reset to -1. You can try the following code: scala> val rdd = sc.parallelize( (1, 'a') :: (2, 'b') :: (3, 'c') :: (4, 'd') :: Nil) rdd: org.apache.spark.

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
It's this: mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean package On Tue, Apr 1, 2014 at 5:15 PM, Vipul Pandey wrote: > how do you recommend building that - it says > ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:assembly > (default-cli) on pro

Re: Issue with zip and partitions

2014-04-01 Thread Xiangrui Meng
>From API docs: "Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the

Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks
Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H wrote: >

Re: How to index each map operation????

2014-04-01 Thread Thierry Herrmann
I'm new to Spark, but isn't this a pure scala question ? The following seems to work with the spark shell: $ spark-shell scala> val rdd = sc.makeRDD(List(10,20,30)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at makeRDD at :12 scala> var cnt = -1 cnt: Int = -1 scala> val rdd2

Re: Status of MLI?

2014-04-01 Thread Nan Zhu
Ah, I see, I’m sorry, I didn’t read your email carefully then I have no idea about the progress on MLBase Best, -- Nan Zhu On Tuesday, April 1, 2014 at 11:05 PM, Krakna H wrote: > Hi Nan, > > I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being > actively de

Re: Status of MLI?

2014-04-01 Thread Krakna H
Hi Nan, I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being actively developed? I'm familiar with mllib and have been looking at its documentation. Thanks! On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] < ml-node+s1001560n3611...@n3.nabble.com>

Re: Status of MLI?

2014-04-01 Thread Nan Zhu
mllib has been part of Spark distribution (under mllib directory), also check http://spark.apache.org/docs/latest/mllib-guide.html and for JIRA, because of the recent migration to apache JIRA, I think all mllib-related issues should be under the Spark umbrella, https://issues.apache.org/jira/b

Status of MLI?

2014-04-01 Thread Krakna H
What is the current development status of MLI/MLBase? I see that the github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no activity in the last 30 days ( https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). Is

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mayur Rustagi
You can get detailed information through Spark listener interface regarding each stage. Multiple jobs may be compressed into a single stage so jobwise information would be same as Spark. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Issue with zip and partitions

2014-04-01 Thread Patrick_Nicolas
Dell - Internal Use - Confidential I got an exception "can't zip RDDs with unusual numbers of Partitions" when I apply any action (reduce, collect) of dataset created by zipping two dataset of 10 million entries each. The problem occurs independently of the number of partitions or when I let Sp

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
how do you recommend building that - it says ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:assembly (default-cli) on project spark-0.9.0-incubating: Error reading assemblies: No assembly descriptors found. -> [Help 1] upon runnning mvn -Dhadoop.version

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Hmm, doing help(rdd) in PySpark doesn't show a method called repartition(). Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0. The approach I'm going with to partition my MappedRDD is to key it by a random int, and then partition it. So something like: rdd = sc.textFile('s3

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas
Alright! Thanks for that link. I did little research based on it and it looks like Snappy or LZO + some container would be better alternatives to gzip. I confirmed that gzip was cramping my style by trying sc.textFile() on an uncompressed version of the text file. With the uncompressed file, sett

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Aaron Davidson
Hm, yeah, the docs are not clear on this one. The function you're looking for to change the number of partitions on any ol' RDD is "repartition()", which is available in master but for some reason doesn't seem to show up in the latest docs. Sorry about that, I also didn't realize partitionBy() had

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
Do you get the same problem if you build with maven? On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey wrote: > SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly > > That's all I do. > > On Apr 1, 2014, at 11:41 AM, Patrick Wendell wrote: > > Vidal - could you show exactly what flags/commands y

Re: Cannot Access Web UI

2014-04-01 Thread Nicholas Chammas
Make that lynx localhost:8080 to isolate any network-related problems. On Tue, Apr 1, 2014 at 6:58 PM, Nicholas Chammas wrote: > Are you trying to access the UI from another machine? If so, first confirm > that you don't have a network issue by opening the UI from the master node > itself. >

Re: Cannot Access Web UI

2014-04-01 Thread Nicholas Chammas
Are you trying to access the UI from another machine? If so, first confirm that you don't have a network issue by opening the UI from the master node itself. For example: yum -y install lynx lynx ip_address:8080 If this succeeds, then you likely have something blocking you from accessing the web

Cannot Access Web UI

2014-04-01 Thread yxzhao
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging As the above shows: " Monitoring and Logging Spark’s standalone mode offers a web-based user interface to monitor the cluster. The master and each worker has its own web UI that shows cluster and job statisti

PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Just an FYI, it's not obvious from the docsthat the following code should fail: a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) a._jrdd.splits().size() a.count() b = a.partitionBy(5) b._jrdd.splits().si

Re: Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann
That worked, thank you both! Thanks also Aaron for the list of things I need to read up on - I hadn't heard of ClassTag before. On Tue, Apr 1, 2014 at 5:10 PM, Aaron Davidson wrote: > Koert's answer is very likely correct. This implicit definition which > converts an RDD[(K, V)] to provide Pair

Re: Generic types and pair RDDs

2014-04-01 Thread Aaron Davidson
Koert's answer is very likely correct. This implicit definition which converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is available for K: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124 To fully understand what's goi

Re: Generic types and pair RDDs

2014-04-01 Thread Koert Kuipers
import org.apache.spark.SparkContext._ import org.apache.spark.rdd.RDD import scala.reflect.ClassTag def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = { rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) } } On Tue, Apr 1, 2014 at 4:55 PM, Daniel

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Aaron Davidson
Looks like you're right that gzip files are not easily splittable [1], and also about everything else you said. [1] http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=rgkjacjaw3tto...@mail.gmail.com%3E On Tue, Apr 1, 2014 at 1:51 PM, Nicholas

Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann
When my tuple type includes a generic type parameter, the pair RDD functions aren't available. Take for example the following (a join on two RDDs, taking the sum of the values): def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) : RDD[(String, Int)] = { rddA.join(rddB).map { case

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas
Alright, so I've upped the minSplits parameter on my call to textFile, but the resulting RDD still has only 1 partition, which I assume means it was read in on a single process. I am checking the number of partitions in pyspark by using the rdd._jrdd.splits().size() trick I picked up on this list.

Protobuf 2.5 Mesos

2014-04-01 Thread Ian Ferreira
>From what I can tell I need to use mesos 0-17 to support protobuf 2.5 which is required for hadoop 2.3.0. However I still run into the JVM error which appears to be related to protobuf compatibility. Any recommendations?

Re: custom receiver in java

2014-04-01 Thread Tathagata Das
Unfortunately, there isnt a good Java-friendly way to define custom receivers. However, I am currently refactoring the receiver interface to make it more Java friendly and I hope to get that in Spark 1.0 release. In the meantime, I would encourage you define the custom receiver in Scala. If you ar

Re: possible bug in Spark's ALS implementation...

2014-04-01 Thread Nick Pentreath
Hi Michael Would you mind setting out exactly what differences you did find between the Spark and Oryx implementations? Would be good to be clear on them, and also see if there are further tricks/enhancements from the Oryx one that can be ported (such as the lambda * numRatings adjustment). N O

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly That's all I do. On Apr 1, 2014, at 11:41 AM, Patrick Wendell wrote: > Vidal - could you show exactly what flags/commands you are using when you > build spark to produce this assembly? > > > On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
Vidal - could you show exactly what flags/commands you are using when you build spark to produce this assembly? On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey wrote: > Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be > getting pulled in unless you are directly using akk

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Kanwaldeep
I've removed the dependency on akka in a separate project but still running into the same error. In the POM Dependency Hierarchy I do see 2.4.1 - shaded and 2.5.0 being included. If there is a conflict with project dependency I would think I should be getting the same error in my local setup as wel

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Kevin Markey
The discussion there hits on the distinction of jobs and stages.  When looking at one application, there are hundreds of stages, sometimes thousands.  Depends on the data and the task.  And the UI seems to track stages.  And one could independently track them for such a j

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Kanwaldeep
Yes I'm using akka as well. But if that is the problem then I should have been facing this issue in my local setup as well. I'm only running into this error on using the spark standalone cluster. But will try out your suggestion and let you know. Thanks Kanwal -- View this message in context:

Re: Mllib in pyspark for 0.8.1

2014-04-01 Thread Matei Zaharia
You could probably port it back, but it required some changes on the Java side as well (a new PythonMLUtils class). It might be easier to fix the Mesos issues with 0.9. Matei On Apr 1, 2014, at 8:53 AM, Ian Ferreira wrote: > > Hi there, > > For some reason the distribution and build for 0.8

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mark Hamstra
Some related discussion: https://github.com/apache/spark/pull/246 On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren wrote: > Hi DB, > > Just wondering if you ever got an answer to your question about monitoring > progress - either offline or through your own investigation. Any findings > would be ap

Mllib in pyspark for 0.8.1

2014-04-01 Thread Ian Ferreira
Hi there, For some reason the distribution and build for 0.8.1 does not include the MLLib libraries for pyspark i.e. import from mllib fails. Seems to be addressed in 0.9.0, but that has other issue running on mesos in standalone mode :) Any pointers? Cheers - Ian

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Philip Ogren
Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings would be appreciated. Thanks, Philip On 01/30/2014 10:32 PM, DB Tsai wrote: Hi guys, When we're running a very long job, we would lik

Use combineByKey and StatCount

2014-04-01 Thread Jaonary Rabarisoa
Hi all; Can someone give me some tips to compute mean of RDD by key , maybe with combineByKey and StatCount. Cheers, Jaonary

custom receiver in java

2014-04-01 Thread eric perler
i would like to write a custom receiver to receive data from a Tibco RV subject i found this scala example.. http://spark.incubator.apache.org/docs/0.8.0/streaming-custom-receivers.html but i cant seem to find a java example does anybody know of a good java example for creating a custom receiver th

Re: Unable to submit an application to standalone cluster which on hdfs.

2014-04-01 Thread haikal.pribadi
How do you remove the validation blocker from the compilation? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-submit-an-application-to-standalone-cluster-which-on-hdfs-tp1730p3574.html Sent from the Apache Spark User List mailing list a

foreach not working

2014-04-01 Thread eric perler
hello.. i am on my second day with spark.. and im having trouble getting the foreach function to work with the network wordcount example.. i can see the the "flatMap" and "map" methods are being invoked.. but i dont seem to be getting into the foreach method... not sure if what i am doing even m

Sliding Subwindows

2014-04-01 Thread aecc
Hello, I would like to have a kind of sub windows. The idea is to have 3 windows in the following way: future <> <-> <--> past w1 w2 w3 So I can do some processing with

SSH problem

2014-04-01 Thread Sai Prasanna
Hi All, I have a five node spark cluster, Master, s1,s2,s3,s4. I have passwordless ssh to all slaves from master and vice-versa. But only one machine, s2, what happens is after 2-3 minutes of my connection from master to slave, the write-pipe is broken. So if try to connect again from master i ge

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
btw, this is where it fails 14/04/01 00:59:32 INFO storage.MemoryStore: ensureFreeSpace(84106) called with curMem=0, maxMem=4939225497 14/04/01 00:59:32 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 82.1 KB, free 4.6 GB) java.lang.UnsupportedOperation

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be > getting pulled in unless you are directly using akka yourself. Are you? No i'm not. Although I see that protobuf libraries are directly pulled into the 0.9.0 assembly jar - I do see the shaded version as well. e.g. b

Re: Configuring distributed caching with Spark and YARN

2014-04-01 Thread santhoma
I think with addJar() there is no 'caching', in the sense files will be copied everytime per job. Whereas in hadoop distributed cache, files will be copied only once, and a symlink will be created to the cache file for subsequent runs: https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fi