Re: Replicating RDD elements

2014-03-28 Thread David Thomas
That helps! Thank you. On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal wrote: > Hi David, > > I am sorry but your question is not clear to me. Are you talking about > taking some value and sharing it across your cluster so that it is present > on all the nodes? You can look at Spark's broadcastin

Re: Announcing Spark SQL

2014-03-28 Thread Rohit Rai
Thanks Patrick, I was thinking about that... Upon analysis I realized (on date) it would be something similar to the way Hive Context using CustomCatalog stuff. I will review it again, on the lines of implementing SchemaRDD with Cassandra. Thanks for the pointer. Upon discussion with couple of ou

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, yes, I'm saying exactly what you interpreted, including that if you tried it, it would (mostly) work, and my uncertainty with respect to guarantees on the semantics. Definitely there would be no fault tolerance if the mutations depend on state that is not captured in the RDD lineage. DD

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Sung Hwan Chung
Thanks Chris, I'm not exactly sure what you mean with MutablePair, but are you saying that we could create RDD[MutablePair] and modify individual rows? If so, will that play nicely with RDD's lineage and fault tolerance? As for the alternatives, I don't think 1 is something we want to do, since

Re: function state lost when next RDD is processed

2014-03-28 Thread Mayur Rustagi
Are you referring to Spark Streaming? Can you save the sum as a RDD & keep joining the two rdd together? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Mar 28, 2014 at 10:47 AM, Adrian Mocanu wrote:

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread Sonal Goyal
What does your saveRDD contain? If you are using custom objects, they should be serializable. Best Regards, Sonal Nube Technologies On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 wrote: > Hi Aureliano, > > I followed this thread to

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Matei Zaharia
Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa wrote: > I forgot to mention that I don't really use all of my data. Instead I use a > sample extracted with randomSample. > > > On Fri,

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to get what you want is to transform to another RDD. But you might look at MutablePair ( https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)

Mutable tagging RDD rows ?

2014-03-28 Thread Sung Hwan Chung
Hey guys, I need to tag individual RDD lines with some values. This tag value would change at every iteration. Is this possible with RDD (I suppose this is sort of like mutable RDD, but it's more) ? If not, what would be the best way to do something like this? Basically, we need to keep mutable i

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi Andriana, Ofcourse u can sortbykey but after that when u perform mapparttion it doesnt guarantee that 1st partition has all those eleement in order as of original sequence..I think we need a partitioner such that it partitions the sequence maintaining order... Could anyone help me in defining

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
Not sure how to change your code because you'd need to generate the keys where you get the data. Sorry about that. I can tell you where to put the code to remap and sort though. import org.apache.spark.rdd.OrderedRDDFunctions val res2=reduced_hccg.map(_._2) .map( x=> (newkey,x)).sortByKey(true)

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi Andriana, Thanks for suggestion.Could you please modify my code part where I need to do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply appropriately..i would be thankful to you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S

Re: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Syed A. Hashmi
>From the jist of it, it seems like you need to override the default partitioner to control how your data is distributed among partitions. Take a look at different Partitioners available (Default, Range, Hash) if none of these get you desired result, you might want to provide your own. On Fri, Ma

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
I say you need to remap so you have a key for each tuple that you can sort on. Then call rdd.sortByKey(true) like this mystream.transform(rdd => rdd.sortByKey(true)) For this fn to be available you need to import org.apache.spark.rdd.OrderedRDDFunctions -Original Message- From: yh18190 [

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi, Here is my code for given scenario.Could you please let me know where to sort?I mean on what basis we have to sort??so that they maintain order in partition as thatof original sequence.. val res2=reduced_hccg.map(_._2)// which gives RDD of numbers res2.foreach(println) val result= res2.ma

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
I think you should sort each RDD -Original Message- From: yh18190 [mailto:yh18...@gmail.com] Sent: March-28-14 4:44 PM To: u...@spark.incubator.apache.org Subject: Re: Splitting RDD and Grouping together to perform computation Hi, Thanks Nanzhu.I tried to implement your suggestion on fol

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-28 Thread Tathagata Das
The cleaner ttl was introduced as a "brute force" method to clean all old data and metadata in the system, so that the system can run 24/7. The cleaner ttl should be set to a large value, so that RDDs older than that are not used. Though there are some cases where you may want to use an RDD again a

Re: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi, Thanks Nanzhu.I tried to implement your suggestion on following scenario.I have RDD of say 24 elements.In that when i partioned into two groups of 12 elements each.Their is loss of order of elements in partition.Elemest are partitioned randomly.I need to preserve the order such that the first 1

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Thanks a lot Ognen! It's not a fancy class that I wrote, and now I realized I neither extends Serializable or register with Kyro and that's why it is not working. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to

RE: function state lost when next RDD is processed

2014-03-28 Thread Adrian Mocanu
Thanks! Ya that's what I'm doing so far, but I wanted to see if it's possible to keep the tuples inside Spark for fault tolerance purposes. -A From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: March-28-14 10:45 AM To: user@spark.apache.org Subject: Re: function state lost when next RDD i

2 weeks until the deadline - Spark Summit call for submissions.

2014-03-28 Thread Scott walent
The second Spark Summit, an event to bring the Apache Spark community together, will be in San Francisco on June 30, 2014. The call for submissions is currently open, but will close on April 11. The summit is looking for talks that will cover topics including applications built on Spark, deployme

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread Ognen Duzlevski
There is also this quote from the Tuning guide (http://spark.incubator.apache.org/docs/latest/tuning.html): " Finally, if you don't register your classes, Kryo will still work, but it will have to store the full class name with each object, which is wasteful." It implies that you don't really

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread pradeeps8
Hi Aureliano, I followed this thread to create a custom saveAsObjectFile. The following is the code. /new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable, BytesWritable](saveRDD.mapPartitions(iter => iter.grouped(10).map(_.toArray)).map(x => (NullWritable.get(), new BytesWritable(seria

Re: Not getting it

2014-03-28 Thread lannyripple
Ok. Based on Sonal's message I dived more into memory and partitioning and got it to work. For the CSV file I used 1024 partitions [textFile(path, 1024)] which cut the partition size down to 8MB (based on standard HDFS 64MB splits). For the key file I also adjusted partitions to use about 8MB.

RE: function state lost when next RDD is processed

2014-03-28 Thread Adrian Mocanu
I'd like to resurrect this thread since I don't have an answer yet. From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-27-14 10:04 AM To: u...@spark.incubator.apache.org Subject: function state lost when next RDD is processed Is there a way to pass a custom function to spark to ru

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-28 Thread Evgeniy Shishkin
One more question, we are using memory_and_disk_ser_2 and i worried when those rdds on disk will be removed http://i.imgur.com/dbq5T6i.png unpersist is set to true, and rdds get purged from memory, but disk space just keep growing. On 28 Mar 2014, at 01:32, Tathagata Das wrote: > Yes, no on

Re: function state lost when next RDD is processed

2014-03-28 Thread Mark Hamstra
As long as the amount of state being passed is relatively small, it's probably easiest to send it back to the driver and to introduce it into RDD transformations as the zero value of a fold. On Fri, Mar 28, 2014 at 7:12 AM, Adrian Mocanu wrote: > I'd like to resurrect this thread since I don't

Re: Run spark on mesos remotely

2014-03-28 Thread Wush Wu
Dear all, After studying the source code and my environment, I guess the problem is that the hostPort is wrong. On my machine, the hostname will be exported into `blockManager.hostPort` such as wush-home:45678, but the slaves could not resolve the hostname to ip correctly. I am trying to solve the

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread Debasish Das
Classes are serialized and sent to all the workers as akka msgs singletons and case classes I am not sure if they are javaserialized or kryoserialized by default But definitely your own classes if serialized by kryo will be much efficient.there is an comparison that Matei did for all

Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Hi, I am sorry if this has been asked before. I found that if I wrapped up some methods in a class with parameters, spark will throw "Task Nonserializable" exception; however if wrapped up in an object or case class without parameters, it will work fine. Is it true that all classes involving RDD o

Re: Not getting it

2014-03-28 Thread lannyripple
I've played around with it. The CSV file looks like it gives 130 partitions. I'm assuming that's the standard 64MB split size for HDFS files. I have increased number of partitions and number of tasks for things like groupByKey and such. Usually I start blowing up on GC Overlimit or sometimes He

streaming: code to simulate a network socket data source

2014-03-28 Thread Diana Carroll
If you are learning about Spark Streaming, as I am, you've probably use netcat "nc" as mentioned in the spark streaming programming guide. I wanted something a little more useful, so I modified the ClickStreamGenerator code to make a very simple script that simply reads a file off disk and passes

Re: Exception on simple pyspark script

2014-03-28 Thread idanzalz
I sorted it out. Turns out that if the client uses Python 2.7 and the server is Python 2.6, you get some weird errors, like this and others. So you would probably want not to do that... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pys

Re: spark streaming and the spark shell

2014-03-28 Thread Diana Carroll
Thanks, Tagatha. This and your other reply on awaitTermination are very helpful. Diana On Thu, Mar 27, 2014 at 4:40 PM, Tathagata Das wrote: > Very good questions! Responses inline. > > TD > > On Thu, Mar 27, 2014 at 8:02 AM, Diana Carroll > wrote: > > I'm working with spark streaming using s

Re: Running a task once on each executor

2014-03-28 Thread dmpour23
Is it possible to do this:\ JavaRDD parttionedRdds = input.map(new Split()).sortByKey().partitionBy(new HashPartitioner(k)).values(); parttionedRdds.saveAsTextFile(args[2]); //Then run my SingletonFunction (My app depends on the saved Files) parttionedRdds.map(new SingletonFunc()); The partti

Guidelines for Spark Cluster Sizing

2014-03-28 Thread Sonal Goyal
Hi, I am looking for any guidelines for Spark Cluster Sizing - are there any best practices or links for estimating the cluster specifications based on input data size, transformations etc? Thanks in advance for helping out. Best Regards, Sonal Nube Technologies

groupByKey is taking more time

2014-03-28 Thread mohit.goyal
Hi, I have two RDD RDD1=K1,V1 RDD2=K1,V1 e.g-(1,List("A","B","C")),(1,List("D","E","F")) RDD1.groupByKey(RDD2) Where K1=Integer V1=List of String If I keep size of V1=3(list of three strings). The groupByKey operation takes 2.6 m and If I keep size of V1=20(list of 20 Strings). The groupByKe

Re: How does Spark handle executor down? RDD in this executor will be recomputed automatically?

2014-03-28 Thread Sonal Goyal
Each handle to the RDD holds its lineage information, which means it knows how it was computed starting from data in a reliable storage or from other RDDs. RDDs hence can be reconstructed when the node fails. Best Regards, Sonal Nube Technologies

Re: GC overhead limit exceeded

2014-03-28 Thread Syed A. Hashmi
Default is MEMORY_ONLY ... if you explicitly persist a RDD, you have to explicitly unpersist it if you want to free memory during the job. On Thu, Mar 27, 2014 at 11:17 PM, Sai Prasanna wrote: > Oh sorry, that was a mistake, the default level is MEMORY_ONLY !! > My doubt was, between two differe

How does Spark handle executor down? RDD in this executor will be recomputed automatically?

2014-03-28 Thread colt_colt
I am curious about Spark fail over scenario, if some executor down, that means the JVM crashed. AM will restart the executor, but how about the RDD data in JVM? if I didn't persist RDD, does Spark will recompute lost RDD or just let it lose? there is some description in Spark site: "Each RDD rem

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote: > Hi all, > > I notice that RDD.cartesian has a strange behavior with cached and > uncached data. More precisely, I have a se

Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More precisely, I have a set of data that I load with objectFile *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")* Then I split it in two set depending on some criteria *val part1 = data

Re:

2014-03-28 Thread Hahn Jiang
I understand. thanks On Fri, Mar 28, 2014 at 4:10 AM, Mayur Rustagi wrote: > You have to raise the global limit as root. Also you have to do that on > the whole cluster. > Regards > Mayur > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi

Re: Not getting it

2014-03-28 Thread Sonal Goyal
Have you tried setting the partitioning ? Best Regards, Sonal Nube Technologies On Thu, Mar 27, 2014 at 10:04 AM, lannyripple wrote: > Hi all, > > I've got something which I think should be straightforward but it's not so > I'm

spark.akka.frameSize setting problem

2014-03-28 Thread lihu
Hi, I just run a simple example to generate some data for the ALS algorithm. my spark version is 0.9, and in local mode, the memory of my node is 108G but when I set conf.set("spark.akka.frameSize", "4096"), it then occurred the following problem, and when I do not set this, it runs well

Exception on simple pyspark script

2014-03-28 Thread idanzalz
Hi, I am a newbie with Spark. I tried installing 2 virtual machines, one as a client and one as standalone mode worker+master. Everything seems to run and connect fine, but when I try to run a simple script, I get weird errors. Here is the traceback, notice my program is just a one-liner: vagran

Re: Replicating RDD elements

2014-03-28 Thread Sonal Goyal
Hi David, I am sorry but your question is not clear to me. Are you talking about taking some value and sharing it across your cluster so that it is present on all the nodes? You can look at Spark's broadcasting in that case. On the other hand, if you want to take one item and create an RDD of 100

Re: ArrayIndexOutOfBoundsException in ALS.implicit

2014-03-28 Thread Xiangrui Meng
Hi bearrito, This is a known issue (https://spark-project.atlassian.net/browse/SPARK-1281) and it should be easy to fix by switching to a hash partitioner. CC'ed dev list in case someone volunteers to work on it. Best, Xiangrui On Thu, Mar 27, 2014 at 8:38 PM, bearrito wrote: > Usage of negati