Re: getting different results from same line of code repeated

2015-11-20 Thread Walrus theCat
ess if this method doesn't work. > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://

getting different results from same line of code repeated

2015-11-18 Thread Walrus theCat
Hi, I'm launching a Spark cluster with the spark-ec2 script and playing around in spark-shell. I'm running the same line of code over and over again, and getting different results, and sometimes exceptions. Towards the end, after I cache the first RDD, it gives me the correct result multiple time

Re: send transformed RDD to s3 from slaves

2015-11-16 Thread Walrus theCat
Update: You can now answer this on stackoverflow for 100 bounty: http://stackoverflow.com/questions/33704073/how-to-send-transformed-data-from-partitions-to-s3 On Fri, Nov 13, 2015 at 4:56 PM, Walrus theCat wrote: > Hi, > > I have an RDD which crashes the driver when being collected

send transformed RDD to s3 from slaves

2015-11-13 Thread Walrus theCat
Hi, I have an RDD which crashes the driver when being collected. I want to send the data on its partitions out to S3 without bringing it back to the driver. I try calling rdd.foreachPartition, but the data that gets sent has not gone through the chain of transformations that I need. It's the dat

Re: SparkSQL 1.2.0 sources API error

2015-01-17 Thread Walrus theCat
I'm getting this also, with Scala 2.11 and Scala 2.10: 15/01/18 07:34:51 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/01/18 07:34:51 INFO Remoting: Starting remoting 15/01/18 07:34:51 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher

maven doesn't build dependencies with Scala 2.11

2015-01-17 Thread Walrus theCat
Hi, When I run this: dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package as per here , maven doesn't build Spark's dependencies. Only when I run: dev/change-version-to-2.11

Re: using multiple dstreams together (spark streaming)

2014-07-17 Thread Walrus theCat
. > > 2s-window-stream.transformWith(1s-window-stream, (rdd1: RDD[...], rdd2: > RDD[...]) => { > ... > // return a new RDD > }) > > > And streamingContext.transform() extends it to N DStreams. :) > > Hope this helps! > > TD > > > > > On Wed,

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread Walrus theCat
I did not! On Wed, Jul 16, 2014 at 12:31 PM, aaronjosephs wrote: > The only other thing to keep in mind is that window duration and slide > duration have to be multiples of batch duration, IDK if you made that fully > clear > > > > -- > View this message in context: > http://apache-spark-user-l

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread Walrus theCat
Here's what I understand: batchDuration: How often should the streaming context update? how many seconds of data should each dstream contain? windowDuration: What size windows are you looking for from this dstream? slideDuration: Once I've given you that slice, how many units forward do you

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
com> wrote: > hum... maybe consuming all streams at the same time with an actor that > would act as a new DStream source... but this is just a random idea... I > don't really know if that would be a good idea or even possible. > > > 2014-07-16 18:30 GMT+01:00 Walrus theCat :

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
problem. On Wed, Jul 16, 2014 at 10:30 AM, Walrus theCat wrote: > Yeah -- I tried the .union operation and it didn't work for that reason. > Surely there has to be a way to do this, as I imagine this is a commonly > desired goal in streaming applications? > > > On Wed, J

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
t; I'm joining several kafka dstreams using the join operation but you have > the limitation that the duration of the batch has to be same,i.e. 1 second > window for all dstreams... so it would not work for you. > > > 2014-07-16 18:08 GMT+01:00 Walrus theCat : > > Hi, >&

using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Hi, My application has multiple dstreams on the same inputstream: dstream1 // 1 second window dstream2 // 2 second window dstream3 // 5 minute window I want to write logic that deals with all three windows (e.g. when the 1 second window differs from the 2 second window by some delta ...) I've

Re: Spark 1.0.1 akka connection refused

2014-07-15 Thread Walrus theCat
I'm getting similar errors on spark streaming -- but at this point in my project I don't need a cluster and can develop locally. Will write it up later, though, if it persists. On Tue, Jul 15, 2014 at 7:44 PM, Kevin Jung wrote: > Hi, > I recently upgrade my spark 1.0.0 cluster to 1.0.1. > But

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
Tathagata determined that the reason it was failing was the accidental creation of multiple input streams. Thanks! On Tue, Jul 15, 2014 at 1:09 PM, Walrus theCat wrote: > Will do. > > > On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das < > tathagata.das1...@gmail.com> wr

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
Will do. On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das wrote: > This sounds really really weird. Can you give me a piece of code that I > can run to reproduce this issue myself? > > TD > > > On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat > wrote: > >> Th

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
This is (obviously) spark streaming, by the way. On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat wrote: > Hi, > > I've got a socketTextStream through which I'm reading input. I have three > Dstreams, all of which are the same window operation over that > socketTextSt

truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-14 Thread Walrus theCat
Hi, I've got a socketTextStream through which I'm reading input. I have three Dstreams, all of which are the same window operation over that socketTextStream. I have a four core machine. As we've been covering lately, I have to give a "cores" parameter to my StreamingSparkContext: ssc = new St

Re: not getting output from socket connection

2014-07-13 Thread Walrus theCat
gt; 1) in your context setup too, (if > you're running locally), or you won't get output. > > > On Sat, Jul 12, 2014 at 11:36 PM, Walrus theCat > wrote: > >> Thanks! >> >> I thought it would get "passed through" netcat, but given your email, I

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
t("local[2]" /**TODO change once a cluster is up **/, "AppName", Seconds(1)) I found something that tipped me off that this might work by digging through this mailing list. On Sun, Jul 13, 2014 at 2:00 PM, Walrus theCat wrote: > More strange behavior: > >

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
More strange behavior: lines.foreachRDD(x => println(x.first)) // works lines.foreachRDD(x => println((x.count,x.first))) // no output is printed to driver console On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat wrote: > > Thanks for your interest. > > lines.foreachRDD(x

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
I got no count. Thanks On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das wrote: > Try doing DStream.foreachRDD and then printing the RDD count and further > inspecting the RDD. > On Jul 13, 2014 1:03 AM, "Walrus theCat" wrote: > >> Hi, >> >> I have a DStrea

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Updated info of block input-0-1405276661400 Any insight? Thanks On Sun, Jul 13, 2014 at 1:03 AM, Walrus theCat wrote: > Hi, > > I have a DStream that works just fine when I say: > > dstream.print > > If I say: > > dstream.map(_,1).print > > that works, too. H

can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Hi, I have a DStream that works just fine when I say: dstream.print If I say: dstream.map(_,1).print that works, too. However, if I do the following: dstream.reduce{case(x,y) => x}.print I don't get anything on my console. What's going on? Thanks

Re: not getting output from socket connection

2014-07-12 Thread Walrus theCat
somehow get that data. > nc is only echoing input from the console. > > On Fri, Jul 11, 2014 at 9:25 PM, Walrus theCat > wrote: > > Hi, > > > > I have a java application that is outputting a string every second. I'm > > running the wordcount example that c

Re: not getting output from socket connection

2014-07-11 Thread Walrus theCat
I forgot to add that I get the same behavior if I tail -f | nc localhost on a log file. On Fri, Jul 11, 2014 at 1:25 PM, Walrus theCat wrote: > Hi, > > I have a java application that is outputting a string every second. I'm > running the wordcount example that comes wi

not getting output from socket connection

2014-07-11 Thread Walrus theCat
Hi, I have a java application that is outputting a string every second. I'm running the wordcount example that comes with Spark 1.0, and running nc -lk . When I type words into the terminal running netcat, I get counts. However, when I write the String onto a socket on port , I don't get

Re: How can adding a random count() change the behavior of my program?

2014-05-11 Thread Walrus theCat
Nick, I have encountered strange things like this before (usually when programming with mutable structures and side-effects), and for me, the answer was that, until .count (or .first, or similar), is called, your variable 'a' refers to a set of instructions that only get executed to form the objec

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Walrus theCat
Dankeschön ! On Tue, Apr 15, 2014 at 11:29 AM, Aaron Davidson wrote: > This is probably related to the Scala bug that :cp does not work: > https://issues.scala-lang.org/browse/SI-6502 > > > On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat wrote: > >> Actually altering t

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Walrus theCat
or@0.0.0.0:48978 Replaying: sc.parallelize(List(1,2,3)) :8: error: not found: value sc sc.parallelize(List(1,2,3)) On Mon, Apr 14, 2014 at 7:51 PM, Walrus theCat wrote: > Nevermind -- I'm like 90% sure the problem is that I'm importing stuff > that declares a SparkCon

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-14 Thread Walrus theCat
Nevermind -- I'm like 90% sure the problem is that I'm importing stuff that declares a SparkContext as sc. If it's not, I'll report back. On Mon, Apr 14, 2014 at 2:55 PM, Walrus theCat wrote: > Hi, > > Using the spark-shell, I can't sc.parallelize to get an RD

can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-14 Thread Walrus theCat
Hi, Using the spark-shell, I can't sc.parallelize to get an RDD. Looks like a bug. scala> sc.parallelize(Array("a","s","d")) java.lang.NullPointerException at (:17) at (:22) at (:24) at (:26) at (:28) at (:30) at (:32) at (:34) at (:36) at .(:40) at .(

RDD[Array] question

2014-03-27 Thread Walrus theCat
Sup y'all, If I have an RDD[Array], if I do some operation in the RDD, then each Array is going to get instantiated on some individual machine, correct (or does it spread it out?) Thanks

Re: coalescing RDD into equally sized partitions

2014-03-26 Thread Walrus theCat
For the record, I tried this, and it worked. On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat wrote: > Oh so if I had something more reasonable, like RDD's full of tuples of > say, (Int,Set,Set), I could expect a more uniform distribution? > > Thanks > > > On Mon

Re: interleave partitions?

2014-03-26 Thread Walrus theCat
Answering my own question here. This may not be efficient, but this is what I came up with: rdd1.coalesce(N).glom.zip(rdd2.coalesce(N).glom).map { case(x,y) => x++y} On Wed, Mar 26, 2014 at 11:11 AM, Walrus theCat wrote: > Hi, > > I want to do something like this: > > rdd3

interleave partitions?

2014-03-26 Thread Walrus theCat
Hi, I want to do something like this: rdd3 = rdd1.coalesce(N).partitions.zip(rdd2.coalesce(N).partitions) I realize the above will get me something like Array[(partition,partition)]. I hope you see what I'm going for here -- any tips on how to accomplish this? Thanks

Re: coalescing RDD into equally sized partitions

2014-03-26 Thread Walrus theCat
gt; default hashCode implementation for integers, which will map them all to 0. > There's no API method that will look at the resulting partition sizes and > rebalance them, but you could use another hash function. > > Matei > > On Mar 24, 2014, at 5:20 PM, Walrus theCat wrote

coalescing RDD into equally sized partitions

2014-03-24 Thread Walrus theCat
Hi, sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect yields Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array()) How do I get something more like: Array(Array(0), Array(20), Array(40), Array(60), Array(80)) Thank

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Walrus theCat
I'm also interested in this. On Mon, Mar 24, 2014 at 4:59 PM, yh18190 wrote: > Hi, I have large data set of numbers ie RDD and wanted to perform a > computation only on groupof two values at a time. For example > 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? > and p

Re: N-Fold validation and RDD partitions

2014-03-24 Thread Walrus theCat
If someone wanted / needed to implement this themselves, are partitions the correct way to go? Any tips on how to get started (say, dividing an RDD into 5 parts)? On Fri, Mar 21, 2014 at 9:51 AM, Jaonary Rabarisoa wrote: > Thank you Hai-Anh. Are the files CrossValidation.scala and > RandomS

Re: question about partitions

2014-03-24 Thread Walrus theCat
> your current partitions (i.e. you are shrinking partitions, do not set > shuffle=true, otherwise it will cause additional unnecessary shuffle > overhead. > > > On Mon, Mar 24, 2014 at 2:32 PM, Walrus theCat wrote: > >> For instance, I need to work with an RDD in terms of N

Re: question about partitions

2014-03-24 Thread Walrus theCat
For instance, I need to work with an RDD in terms of N parts. Will calling RDD.coalesce(N) possibly cause processing bottlenecks? On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat wrote: > Hi, > > Quick question about partitions. If my RDD is partitioned into 5 > partitions, does th

question about partitions

2014-03-24 Thread Walrus theCat
Hi, Quick question about partitions. If my RDD is partitioned into 5 partitions, does that mean that I am constraining it to exist on at most 5 machines? Thanks

Re: inexplicable exceptions in Spark 0.7.3

2014-03-18 Thread Walrus theCat
Hi Andrew, Thanks for your interest. This is a standalone job. On Mon, Mar 17, 2014 at 4:30 PM, Andrew Ash wrote: > Are you running from the spark shell or from a standalone job? > > > On Mon, Mar 17, 2014 at 4:17 PM, Walrus theCat wrote: > >> Hi, >> >> I&

inexplicable exceptions in Spark 0.7.3

2014-03-17 Thread Walrus theCat
Hi, I'm getting this stack trace, using Spark 0.7.3. No references to anything in my code, never experienced anything like this before. Any ideas what is going on? java.lang.ClassCastException: spark.SparkContext$$anonfun$9 cannot be cast to scala.Function2 at spark.scheduler.ResultTask$.de

Re: links for the old versions are broken

2014-03-17 Thread Walrus theCat
On Thu, Mar 13, 2014 at 11:05 AM, Aaron Davidson wrote: > Looks like everything from 0.8.0 and before errors similarly (though > "Spark 0.3 for Scala 2.9" has a malformed link as well). > > > On Thu, Mar 13, 2014 at 10:52 AM, Walrus theCat wrote: > >> Sup,

links for the old versions are broken

2014-03-13 Thread Walrus theCat
Sup, Where can I get Spark 0.7.3? It's 404 here: http://spark.apache.org/downloads.html Thanks