Re: Combining RDD's columns

2014-04-18 Thread Jeremy Freeman
Hi Ian, If I understand what you're after, you might find "zip" useful. From the docs: Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same numb

Combining RDD's columns

2014-04-18 Thread Ian Ferreira
This may seem contrived but, suppose I wanted to create a collection of "single column" RDD's that contain calculated values, so I want to cache these to avoid re-calc. i.e. rdd1 = {Names] rdd2 = {Star Sign} rdd3 = {Age} Then I want to create a new virtual RDD that is a collection of thes

Re: Spark-ec2 asks for password

2014-04-18 Thread Aureliano Buendia
Frank, Thanks for the prompt reply. Unfortunately I've been experiencing this for the past few weeks on N Virginia farm, note that the latency might also depend on the instance type. I'll try to amend the ec2 script as you suggested, but that will mean waiting even longer for the cluster to come

Re: Spark-ec2 asks for password

2014-04-18 Thread Patrick Wendell
Unfortunately - I think a lot of this is due to generally increased latency on ec2 itself. I've noticed that it's way more common than it used to be for instances to come online past the "wait" timeout in the ec2 script. On Fri, Apr 18, 2014 at 9:11 PM, FRANK AUSTIN NOTHAFT wrote: > Aureliano,

Re: Calliope Frame size larger than max length

2014-04-18 Thread Rohit Rai
Hello Eric, This happens when the data being fetched from Cassandra in single split is greater than the maximum framesize allowed in thrift (yes it still uses thrift underneath, until the next release when we will start using Native CQL). Generally, we do set the the Cassandra the framesize in Ca

Re: Spark-ec2 asks for password

2014-04-18 Thread FRANK AUSTIN NOTHAFT
Aureliano, I've been noticing this error recently as well: ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: Connection refused Error 255 while executing remote command, retrying after 30 seconds However, this isn't an issue with the spark-ec2 scripts. After the scripts fail,

Spark-ec2 asks for password

2014-04-18 Thread Aureliano Buendia
Hi, Since 0.9.0 spark-ec2 has gone unstable. During launch it throws many errors like: ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: Connection refused Error 255 while executing remote command, retrying after 30 seconds .. and recently, it prompts for passwords!: Warning:

Re: ui broken in latest 1.0.0

2014-04-18 Thread Andrew Or
Hi Koert, I've tracked down what the bug is. The caveat is that each StageInfo only keeps around the RDDInfo of the last RDD associated with the Stage. More concretely, if you have something like sc.parallelize(1 to 1000).persist.map(i => (i, i)).count() This creates two RDDs within one Stage, a

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Aureliano Buendia
Good catch, Daniel. Looks like this is a scala bug, not a spark one. Yet, spark users got to be careful not using NumericRange. On Fri, Apr 18, 2014 at 9:05 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > To make up for mocking Scala, I've filed a bug ( > https://issues.scala-lan

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-18 Thread Marcelo Vanzin
Hi Sung, On Fri, Apr 18, 2014 at 5:11 PM, Sung Hwan Chung wrote: > while (true) { > rdd.map((row : Array[Double]) => { > row[numCols - 1] = computeSomething(row) > }).reduce(...) > } > > If it fails at some point, I'd imagine that the intermediate info being > stored in row[numCols - 1] w

Re: Valid spark streaming use case?

2014-04-18 Thread Tathagata Das
Regarding memory usage, you can configure Spark's memory fraction such that persisted state RDDs fall out to disk and does crash the JVM. Also, the state RDDs are periodically checkpointed HDFS for better recoverability. But this seems like a pretty involved usecase that needs keeping around all t

Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-18 Thread Sung Hwan Chung
Are there scenarios where the developers have to be aware of how Spark's fault tolerance works to implement correct programs? It seems that if we want to maintain any sort of mutable state in each worker through iterations, it can have some unintended effect once a machine goes down. E.g., while

Re: Anyone using value classes in RDDs?

2014-04-18 Thread Koert Kuipers
isn't valueclasses for primitives (AnyVal) only? that doesn't apply to string, which is an object (AnyRef) On Fri, Apr 18, 2014 at 2:51 PM, kamatsuoka wrote: > I'm wondering if anyone has tried using value classes in RDDs? My use case > is that I have a number of RDDs containing strings, e.g.

Re: Spark on YARN performance

2014-04-18 Thread Nishkam Ravi
Spark-on-YARN takes 10-30 seconds of setup time for workloads like WordCount and PageRank on a small-sized cluster and thereafter performs as well as Spark standalone, as has been noted by Tom and Patrick. However, certain amount of configuration/tuning effort is required to match peak performance.

Fwd: BFS implemented

2014-04-18 Thread Ghufran Malik
Ahh nvm I found the solution :) triplet.srcAttr != Double.PositiveInfinity && triplet.dstAttr == Double.PositiveInfinity as my new if condition. -- Forwarded message -- From: Ghufran Malik Date: 18 April 2014 23:15 Subject: BFS implemented To: user@spark.apache.org Hi I have s

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
Sorry, that was incomplete information, I think Spark's compression helped (not sure how much though) that the actual memory requirement may have been smaller. On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung wrote: > I would argue that memory in clusters is still a limited resource and it's > s

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
I would argue that memory in clusters is still a limited resource and it's still beneficial to use memory as economically as possible. Let's say that you are training a gradient boosted model in Spark, which could conceivably take several hours to build hundreds to thousands of trees. You do not wa

BFS implemented

2014-04-18 Thread Ghufran Malik
Hi I have sucessfully implemented the Breadth First Search algorithm using the Pregel operator in graphX as follows: val graph = GraphLoader.edgeListFile(sc, "graphx/data/test_graph.txt") val root: VertexId = 1 val initialGraph = graph.mapVertices((id, _) => if (id == root) 0.0 else Double.Positi

Calliope Frame size larger than max length

2014-04-18 Thread ericjohnston1989
Hey all, I'm working with Calliope to run jobs on a Cassandra cluster in standalone mode. On some larger jobs I run into the following error: java.lang.RuntimeException: Frame size (20667866) larger than max length (15728640)! at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowI

Thrift client to write directly to rdd

2014-04-18 Thread bhardwaj_rajesh
Hello, Does spark provides a thrift remote client by which we can write directly into spark rdd, run some computation and save result back to hadoop hive/or some hdfs file. If yes, can some one provide me link to thrift interface files regards -- View this message in context: http://apache-sp

Re: Random Forest on Spark

2014-04-18 Thread Sandy Ryza
I don't think the YARN default of max 8GB container size is a good justification for limiting memory per worker. This is a sort of arbitrary number that came from an era where MapReduce was the main YARN application and machines generally had less memory. I expect to see this to get configured as

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Daniel Darabos
To make up for mocking Scala, I've filed a bug ( https://issues.scala-lang.org/browse/SI-8518) and will try to patch this. On Fri, Apr 18, 2014 at 9:24 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Looks like NumericRange in Scala is just a joke. > > scala> val x = 0.0 to 1.0 b

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Daniel Darabos
Looks like NumericRange in Scala is just a joke. scala> val x = 0.0 to 1.0 by 0.1 x: scala.collection.immutable.NumericRange[Double] = NumericRange(0.0, 0.1, 0.2, 0.30004, 0.4, 0.5, 0.6, 0.7, 0.7999, 0.8999, 0.) scala> x.take(3) res1: scala.coll

Anyone using value classes in RDDs?

2014-04-18 Thread kamatsuoka
I'm wondering if anyone has tried using value classes in RDDs? My use case is that I have a number of RDDs containing strings, e.g. val r1: RDD[(String, (String, Int)] = ... val r2: RDD[(String, (String, Int)] = ... and it might be clearer if I wrote case class ID(val id: String) extends AnyVa

Re: Random Forest on Spark

2014-04-18 Thread Sean Owen
On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung wrote: > Debasish, > > Unfortunately, we are bound to YARN, at least for the time being, because > that's what most of our customers would be using (unless, all the Hadoop > vendors start supporting standalone Spark - I think Cloudera might do > tha

Re: Random Forest on Spark

2014-04-18 Thread Manish Amde
Sorry for arriving late to the party! Evan has clearly explained the current implementation, our future plans and key differences with the PLANET paper. I don't think I can add more to his comments. :-) I apologize for not creating the corresponding JIRA tickets for the tree improvements (multicla

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
Debasish, Unfortunately, we are bound to YARN, at least for the time being, because that's what most of our customers would be using (unless, all the Hadoop vendors start supporting standalone Spark - I think Cloudera might do that?). On Fri, Apr 18, 2014 at 11:12 AM, Debasish Das wrote: > Sp

Re: Random Forest on Spark

2014-04-18 Thread Debasish Das
Spark on YARN is a big pain due to the strict memory requirement per container... If you are stress testing it, could you use a standalone cluster and see at which feature upper bound does per worker RAM requirement reaches 16 GB or more...it is possible to get 16 GB instances on EC2 these days wi

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Mark Hamstra
Please file an issue: Spark Project JIRA On Fri, Apr 18, 2014 at 10:25 AM, Aureliano Buendia wrote: > Hi, > > I just notices that sc.makeRDD() does not make all values given with input > type of NumericRange, try this in spark shell: > > > $ MASTER=l

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
Thanks for the info on mem requirement. I think that a lot of businesses would probably prefer to use Spark on top of YARN, since that's what they invest on - a large Hadoop cluster. And the default setting for YARN seems to cap memory per container to 8 GB - so ideally, we would like to use a lot

sc.makeRDD bug with NumericRange

2014-04-18 Thread Aureliano Buendia
Hi, I just notices that sc.makeRDD() does not make all values given with input type of NumericRange, try this in spark shell: $ MASTER=local[4] bin/spark-shell scala> sc.makeRDD(0.0 to 1 by 0.1).collect().length *8* The expected length is 11. This works correctly when lanching spark with onl

strange StreamCorruptedException

2014-04-18 Thread Lukas Nalezenec
Hi all, I am running algorithm similar to wordcount and I am not sure why it fails at end, there are only 200 words so result of the computation should be small. I have got SIMR command line with Spark 0.8.1 , 50 workers each with ~512M RAM. The dataset is 100 GB tab separated text HadoopRD

Re: Valid spark streaming use case?

2014-04-18 Thread xargsgrep
Thanks, I played around with that example and had some followup questions. 1. The only way I was able to accumulate data per-key was to actually store all the data in the state, not just the timestamp (see example below). Otherwise I don't have access to data older than the batchDuration of the St

Re: Random Forest on Spark

2014-04-18 Thread Evan R. Sparks
Interesting, and thanks for the thoughts. I think we're on the same page with 100s of millions of records. We've tested the tree implementation in mllib on 1b rows and up to 100 features - though this isn't hitting the 1000s of features you mention. Obviously multi class support isn't there yet,

Re: Using google cloud storage for spark big data

2014-04-18 Thread Aureliano Buendia
Thanks, Andras. What approach did you use to setup a spark cluster on google compute engine? Currently, there is no production-ready official support for an equivalent of spark-ec2 on gce. Did you roll your own? On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth < andras.nem...@lynxanalytics.com> wr

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-18 Thread Arpit Tak
Hi all, If the cluster is running and I want to add slaves to existing cluster , which is the best way of doing it: 1.) As Matei said, select slave launch more of these 2.) Create a AMI of it and launch more of it like these . The plus point of first is that its faster , but I have to rync every

Re: reduceByKey issue in example wordcount (scala)

2014-04-18 Thread Ian Bonnycastle
I just wanted to let you know, Marcelo, and others who may run into this error in the future... I figured it out! When I first started to work on my scripts, I was using "sbt/sbt package" followed by an "sbt/sbt run". But, when I saw "sbt/sbt run" show that it was compiling the script, I gave up o

how to split one big RDD (about 100G) into several small ones?

2014-04-18 Thread Joe L
I want to split a single big rdd into small rdds without reading too much from disk (hdfs). what is the best way to do that? this is my current code: subclass_pairs= schema_triples.filter(lambda (s, p, o): p == PROPERTIES['subClassOf']).map(lambda (s, p, o): (s, o)) subproperty_pairs = s

Re: Re: Random Forest on Spark

2014-04-18 Thread Sebastian Schelter
Hi, Stratosphere does not have a real RF implementation yet, there is only a prototype that has been developed by students in a university course which is far from production usage at this stage. --sebastian On 04/18/2014 10:31 AM, Sean Owen wrote: Mahout RDF is fairly old code. If you try

Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT
sorry I mismatched the link, it should be https://gist.github.com/wpm/6454814 and the algorithm is not ExtraTrees but a basic ensemble of boosted trees. 2014-04-18 10:31 GMT+02:00 Eustache DIEMERT : > Another option is to use ExtraTrees as provided by scikit-learn with > pyspark: > > https://g

Re: RDD collect help

2014-04-18 Thread Eugen Cepoi
Indeed, serialization is always tricky when you want to work on objects that are more sophisticated than simple POJOs. And you can have sometimes unexpected behaviour when using the deserialized objects. In my case I had troubles when serializaing/deser Avro specific records with lists. The impleme

Re: RDD collect help

2014-04-18 Thread Flavio Pompermaier
Ok thanks. However it turns out that there's a problem with that and it's not so safe to use kryo serialization with Spark: Exception in thread "Executor task launch worker-0" java.lang.NullPointerException at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.

Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT
Is there a PR or issue where GBT / RF progress in MLLib is tracked ? 2014-04-17 21:11 GMT+02:00 Evan R. Sparks : > Sorry - I meant to say that "Multiclass classification, Gradient > Boosting, and Random Forest support based on the recent Decision Tree > implementation in MLlib is planned and com

Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT
Another option is to use ExtraTrees as provided by scikit-learn with pyspark: https://github.com/pydata/pyrallel/blob/master/pyrallel/ensemble.py#L27-L59 this is a proof of concept right now and should be hacked to what you need, but the core decision tree implementation is highly optimized and c

Re: Random Forest on Spark

2014-04-18 Thread Sean Owen
Mahout RDF is fairly old code. If you try it, try to use 1.0-SNAPSHOT, as you will almost certainly need this patch to make it run reasonably fast: https://issues.apache.org/jira/browse/MAHOUT-1419 I have not tried Stratosphere here. Since we are on the subject of RDF on Hadoop, possibly on M/R,

Re: RDD collect help

2014-04-18 Thread Eugen Cepoi
Because it happens to reference something outside the closures scope that will reference some other objects (that you don't need) and so one, resulting in serializing with your task a lot of things that you don't want. But sure it is discutable and it's more my personal opinion. 2014-04-17 23:28

Re: AmpCamp exercise in a local environment

2014-04-18 Thread Arpit Tak
Download Cloudera VM from here. https://drive.google.com/file/d/0B7zn-Mmft-XcdTZPLXltUjJyeUE/edit?usp=sharing Regards, Arpit Tak On Fri, Apr 18, 2014 at 1:20 PM, Arpit Tak wrote: > HI Nabeel, > > I have a cloudera VM , It has both spark and shark installed in it. > You can download and

Re: Random Forest on Spark

2014-04-18 Thread Laeeq Ahmed
Have anyone tried mahout RF or Stratosphere RF with spark. Any comments. Regards, Laeeq On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung wrote: Yes, it should be data specific and perhaps we're biased toward the data sets that we are playing with. To put things in perspective, we're highly

Re: AmpCamp exercise in a local environment

2014-04-18 Thread Arpit Tak
HI Nabeel, I have a cloudera VM , It has both spark and shark installed in it. You can download and play around with it . i also have some sample data in hdfs and some table . You can try out those examples. How to use it ..(instructions are in docs...). https://drive.google.com/file/d/0B0Q4

Re: writing booleans w Calliope

2014-04-18 Thread Rohit Rai
Hello Adrian, Calliope relies on transformers to convert from a given type to ByteBuffer which is the format that is required by Cassandra. RichByteBuffer's incompleteness is at fault here. We are working on increasing the types we support out of the box, and will support all types supported in

SPARK Shell RDD reuse

2014-04-18 Thread Sai Prasanna
Hi All, In the interactive shell the spark context remains same. So if run a query multiple times, the RDDs created by previous runs will be reused in the subsequent runs and not recomputed until i exit and restart the shell again right? Or is there a way to force to reuse/recompute in the presen

[no subject]

2014-04-18 Thread Sai Prasanna
Hi All, In the interactive shell the spark context remains same. So if run a query multiple times, the RDDs created by previous runs will be reused in the subsequent runs and not recomputed until i exit and restart the shell again right? Or is there a way to force to reuse/recompute in the presen