Largest input data set observed for Spark.

2014-03-20 Thread Usman Ghani
All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park?

Re: Machine Learning on streaming data

2014-03-20 Thread Pascal Voitot Dev
Hi, I tried a few things on that in my last blog post on : http://mandubian.com/2014/03/10/zpark-ml-nio-3/ (last part of a tryptic about spark & scalaz-stream) I built a collaborative filtering and then use it on each RDD of the DStream usingn a transform { rdd => model.predict(rdd)... }. It work

Re: Relation between DStream and RDDs

2014-03-20 Thread Pascal Voitot Dev
If I may add my contribution to this discussion if I understand well your question... DStream is discretized stream. It discretized the data stream over windows of time (according to the project code I've read and paper too). so when you write: JavaStreamingContext stcObj = new JavaStreamingConte

Hadoop streaming like feature for Spark

2014-03-20 Thread Jaonary Rabarisoa
Dear all, Dear all, Does Spark has a kind of Hadoop streaming feature to run external process to manipulate data from RDD sent through stdin and stdout ? Best, Jaonary

Re: Relation between DStream and RDDs

2014-03-20 Thread Sanjay Awatramani
@TD: I do not need multiple RDDs in a DStream in every batch. On the contrary my logic would work fine if there is only 1 RDD. But then the description for functions like reduce & count (Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStr

Cassandra CQL read/write from spark using Java - [Remoting] Remoting error: [Startup timed out]

2014-03-20 Thread sonyjv
Hi, In standalone mode I am trying to perform some Cassandra CQL read/write operations. Following is my Maven dependencies. org.apache.spark spark-core_2.10 0.9.0-incubating

Re: Relation between DStream and RDDs

2014-03-20 Thread Pascal Voitot Dev
Actually it's quite simple... DStream[T] is a stream of RDD[T]. So applying count on DStream is just applying count on each RDD of this DStream. So at the end of count, you have a DStream[Int] containing the same number of RDDs as before but each RDD just contains one element being the count resul

Re: Relation between DStream and RDDs

2014-03-20 Thread andy petrella
also consider creating pairs and use *byKey* operators, and then the key will be the structure that will be used to consolidate or deduplicate your data my2c On Thu, Mar 20, 2014 at 11:50 AM, Pascal Voitot Dev < pascal.voitot@gmail.com> wrote: > Actually it's quite simple... > > DStream[T] i

Re: Relation between DStream and RDDs

2014-03-20 Thread Pascal Voitot Dev
On Thu, Mar 20, 2014 at 11:57 AM, andy petrella wrote: > also consider creating pairs and use *byKey* operators, and then the key > will be the structure that will be used to consolidate or deduplicate your > data > my2c > > One thing I wonder: imagine I want to sub-divide RDDs in a DStream into s

Re: Cassandra CQL read/write from spark using Java - [Remoting] Remoting error: [Startup timed out]

2014-03-20 Thread sonyjv
Hi, I managed to solve the issue. The problem was related to Netty. (Ref. https://spark-project.atlassian.net/browse/SPARK-1138) Changed the dependencies with Netty exclusions and included Netty 3.6.6. as a dependency. org.apache.spark spark-

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-20 Thread Chester
I am curious to see if you have tried on Parallela supercomputer (16 or 64 cores) cluster, run spark on that should be fun. Chester Sent from my iPad On Mar 19, 2014, at 9:18 AM, Chanwit Kaewkasi wrote: > Hi Koert, > > There's some NAND flash built-in each node. We mount the NAND flash as >

Error while reading from HDFS Simple application

2014-03-20 Thread Laeeq Ahmed
VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CreateSnapshotRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; What can be cause of this error? Regards, Laeeq Ahmed, PhD Student, HPCViz, KTH. http://laeeprofile.

Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
Hi, My team is trying to replicate an existing Map/Reduce process in Spark. Basically, we are creating Bloom Filters for quick set membership tests within our processing pipeline. We have a single column (call it group_id) that we use to partition into sets. As you would expect, in the map phas

Re: Spark worker threads waiting

2014-03-20 Thread sparrow
This is what the web UI looks like: [image: Inline image 1] I also tail all the worker logs and theese are the last entries before the waiting begins: 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, minRequest: 10066329 14/03/20 13:29:10 INFO Blo

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-20 Thread Chanwit Kaewkasi
Hi Chester, It is on our todo-list but it doesn't work at the moment. The Parallela cores can not be utilized by the JVM. So, Spark will just use its ARM cores. We'll be looking at Parallela again when the JVM supports it. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On

graphx samples in Java

2014-03-20 Thread David Soroko
Hi Where can I find the equivalent of the graphx example (http://spark.apache.org/docs/0.9.0/graphx-programming-guide.html#examples ) in Java ? For example. How does the following translates to Java val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-20 Thread Eustache DIEMERT
Hey, do you have a blog post or url I can share ? This is a quite cool experiment ! E/ 2014-03-20 15:01 GMT+01:00 Chanwit Kaewkasi : > Hi Chester, > > It is on our todo-list but it doesn't work at the moment. The > Parallela cores can not be utilized by the JVM. So, Spark will just > use its A

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-20 Thread Chanwit Kaewkasi
Thanks, Eustache. There's the link in the second reply to an article I wrote for DZone. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Thu, Mar 20, 2014 at 9:39 PM, Eustache DIEMERT wrote: > Hey, do you have a blog post or url I can share ? > > This is a quite cool exp

Reload RDD saved with saveAsObjectFile

2014-03-20 Thread Jaonary Rabarisoa
Hi all, I have an RDD[(String,(String,Array[Byte]))] that I save into sequence file with "saveAsObjectFile". At the end, Spark writes a sequence file of (NullWritable, BytesWritable). How to get back my tuple of (String,(String,Array[Bytes]) ? Any ideas will be helpfull. Cheers, Jaonary

Re: PySpark worker fails with IOError Broken Pipe

2014-03-20 Thread Jim Blomo
I think I've encountered the same problem and filed https://spark-project.atlassian.net/plugins/servlet/mobile#issue/SPARK-1284 For me it hung the worker, though. Can you add reproducible steps and what version you're running? On Mar 19, 2014 10:13 PM, "Nicholas Chammas" wrote: > So I have the p

Re: Transitive dependency incompatibility

2014-03-20 Thread Jaka Jančar
Could the executor use isolated classloaders, in order not to pullute the environment with it's own stuff? On Wednesday, March 19, 2014 at 8:30 AM, Jaka Jančar wrote: > Hi, > > I'm getting the following error: > > java.lang.NoSuchMethodError: > org.apache.http.impl.conn.DefaultClientCon

Re: Hadoop streaming like feature for Spark

2014-03-20 Thread Ewen Cheslack-Postava
Take a look at RDD.pipe(). You could also accomplish the same thing using RDD.mapPartitions, which you pass a function that processes the iterator for each partition rather than processing each element individually. This lets you only start up as many processes as there are partitions, pipe the

Re: Spark Java example using external Jars

2014-03-20 Thread dmpour23
Thanks for the example. However my main problem is that what i would like to do is: Create a SparkApp that will Sort and Partition the initial file (k) times based on a key. JavaSparkContext ctx = new JavaSparkContext("spark://dmpour:7077", "BasicFileSplit", System.getenv("SPARK_HOME"), J

is sorting necessary after join of sorted RDD

2014-03-20 Thread Adrian Mocanu
Is sorting necessary after a join on 2 sorted RDD? Example: val stream1= mystream.transform(rdd=>rdd.sortByKey(true)) val stream2= mystream.transform(rdd=>rdd.sortByKey(true)) val stream3= stream1.join(stream2) Will stream3 be ordered? -Adrian

Re: Relation between DStream and RDDs

2014-03-20 Thread andy petrella
Don't see an example, but conceptually it looks like you'll need an according structure like a Monoid. I mean, because if it's not tied to a window, it's an overall computation that has to be increased over time (otherwise it would land in the batch world see after) and that will be the purpose of

Re: PySpark worker fails with IOError Broken Pipe

2014-03-20 Thread Nicholas Chammas
I'm using Spark 0.9.0 on EC2, deployed via spark-ec2. The few times it's happened to me so far, the shell will just be idle for a few minutes and then BAM I get that error, but the shell still seems to work. If I find a pattern to the issue I will report it here. On Thu, Mar 20, 2014 at 8:10 AM

Re: Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
We ended up going with: map() - set the group_id as the key in a Tuple reduceByKey() - end up with (K,Seq[V]) map() - create the bloom filter and loop through the Seq and persist the Bloom filter This seems to be fine. I guess Spark cannot optimize the reduceByKey and map steps to occur together

Re: Largest input data set observed for Spark.

2014-03-20 Thread Surendranauth Hiraman
Reynold, How complex was that job (I guess in terms of number of transforms and actions) and how long did that take to process? -Suren On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin wrote: > Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - > I didn't count the size of

Re: Transitive dependency incompatibility

2014-03-20 Thread Matei Zaharia
Hi Jaka, I’d recommend rebuilding Spark with a new version of the HTTPClient dependency. In the future we want to add a “put the user’s classpath first” option to let users overwrite dependencies. Matei On Mar 20, 2014, at 8:42 AM, Jaka Jančar wrote: > Could the executor use isolated classlo

Re: Largest input data set observed for Spark.

2014-03-20 Thread Reynold Xin
I'm not really at liberty to discuss details of the job. It involves some expensive aggregated statistics, and took 10 hours to complete (mostly bottlenecked by network & io). On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Reynold, > > How complex w

Re: Largest input data set observed for Spark.

2014-03-20 Thread Andrew Ash
Understood of course. Did the data fit comfortably in memory or did you experience memory pressure? I've had to do a fair amount of tuning when under memory pressure in the past (0.7.x) and was hoping that the handling of this scenario is improved in later Spark versions. On Thu, Mar 20, 2014 a

Re: Accessing the reduce key

2014-03-20 Thread Mayur Rustagi
Why are you trying to reducebyKey? Are you looking to work on the data sequentially. If I understand correctly you are looking to filter your data using the bloom filter & each bloom filter is tied to which key is instantiating it. Following are some of the options *partiition* your data by key & u

sort order after reduceByKey / groupByKey

2014-03-20 Thread Ameet Kini
val rdd2 = rdd.partitionBy(my partitioner).reduceByKey(some function) I see that rdd2's partitions are not internally sorted. Can someone confirm that this is expected behavior? And if so, the only way to get partitions internally sorted is to follow it with something like this val rdd2 = rdd.par

Re: sort order after reduceByKey / groupByKey

2014-03-20 Thread Mayur Rustagi
Thats expected. I think sortByKey is option too & probably a better one. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 20, 2014 at 3:20 PM, Ameet Kini wrote: > > val rdd2 = rdd.partitionBy(my partitioner).red

Re: sort order after reduceByKey / groupByKey

2014-03-20 Thread Ameet Kini
I saw that but I don't need a global sort, only intra-partition sort. Ameet On Thu, Mar 20, 2014 at 3:26 PM, Mayur Rustagi wrote: > Thats expected. I think sortByKey is option too & probably a better one. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rust

Re: Spark worker threads waiting

2014-03-20 Thread Mayur Rustagi
I would have preferred the stage window details & aggregate task details(above the task list). Basically if you run a job , it translates to multiple stages, each stage translates to multiple tasks (each run on worker core). So some breakup like my job is taking 16 min 3 stages , stage 1 : 5 min St

Re: Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
Mayur, Thanks. This step is for creating the Bloom Filter, not using it to filter data, actually. But your answer still stands. Partitioning by key, having the bloom filters as a broadcast variable and then doing mappartition makes sense. Are there performance implications for this approach, suc

Flume Corrupted Stream Error

2014-03-20 Thread bbuild11
I have setup a Flume-NG 1.4.0-cdh4.5.0 spooldir source agent to pickup a CSV file from a directory and pass it to Spark Streaming Flume Stream 0.9.0-incubating. When the file has 2 rows in it, there is no error. Once I add another row, then I get the following error. Mind you, each row has 75 colum

Re: in SF until Friday

2014-03-20 Thread Mayur Rustagi
Would love to .. but I am in NY till Friday :( Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Mar 19, 2014 at 7:34 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I'm in San Francisco until Friday for a

Re: Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
Mayur, To be a little clearer, for creating the Bloom Filters, I don't think broadcast variables are the way to go, though definitely that would work for using the Bloom Filters to filter data. The reason why is that the creation needs to happen in a single thread. Otherwise, some type of locking

DStream spark paper

2014-03-20 Thread Adrian Mocanu
I looked over the specs on page 9 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf The first paragraph mentions the window size is 30 seconds "Word-Count, which performs a sliding window count over 30s; and TopKCount, which finds the k most frequent words over the past 30s.

Re: Hadoop streaming like feature for Spark

2014-03-20 Thread Jaonary Rabarisoa
Thank you Ewen. RDD.pipe is what I need and it works like a charm. On the other side RDD.mapPartitions seems to be interesting but I can't figure out how to make it work. Jaonary On Thu, Mar 20, 2014 at 4:54 PM, Ewen Cheslack-Postava wrote: > Take a look at RDD.pipe(). > > You could also accomp

Re: Largest input data set observed for Spark.

2014-03-20 Thread Soila Pertet Kavulya
Hi Reynold, Nice! What spark configuration parameters did you use to get your job to run successfully on a large dataset? My job is failing on 1TB of input data (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory errors just lost executors. Thanks, Soila On Mar 20, 2014 11:

Re: Machine Learning on streaming data

2014-03-20 Thread Jeremy Freeman
Thanks TD, happy to share my experience with MLLib + Spark Streaming integration. Here's a gist with two examples I have working, one for StreamingLinearRegression and another for StreamingKMeans. https://gist.github.com/freeman-lab/9672685 The goal in each case was to implement a streaming ve

Re: Accessing the reduce key

2014-03-20 Thread Mayur Rustagi
You are using the data grouped (sorted?) To create the bloom filter ? On Mar 20, 2014 4:35 PM, "Surendranauth Hiraman" wrote: > Mayur, > > To be a little clearer, for creating the Bloom Filters, I don't think > broadcast variables are the way to go, though definitely that would work > for using t

Re: Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
Grouped by the group_id but not sorted. -Suren On Thu, Mar 20, 2014 at 5:52 PM, Mayur Rustagi wrote: > You are using the data grouped (sorted?) To create the bloom filter ? > On Mar 20, 2014 4:35 PM, "Surendranauth Hiraman" > wrote: > >> Mayur, >> >> To be a little clearer, for creating the B

Re: Pyspark worker memory

2014-03-20 Thread Andrew Ash
Jim, I'm starting to document the heap size settings all in one place, which has been a confusion for a lot of my peers. Maybe you can take a look at this ticket? https://spark-project.atlassian.net/browse/SPARK-1264 On Wed, Mar 19, 2014 at 12:53 AM, Jim Blomo wrote: > To document this, it wo

Re: Pyspark worker memory

2014-03-20 Thread Matei Zaharia
Yeah, this is definitely confusing. The motivation for this was that different users of the same cluster may want to set different memory sizes for their apps, so we decided to put this setting in the driver. However, if you put SPARK_JAVA_OPTS in spark-env.sh, it also applies to executors, whic

Re: DStream spark paper

2014-03-20 Thread Matei Zaharia
Hi Adrian, On every timestep of execution, we receive new data, then report updated word counts for that new data plus the past 30 seconds. The latency here is about how quickly you get these updated counts once the new batch of data comes in. It’s true that the count reflects some data from 30

Sprak Job stuck

2014-03-20 Thread mohit.goyal
Hi, I have run the spark application to process input data of size ~14GB with executor memory 10GB. The job got stuck with below message 14/03/21 05:02:07 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, guavus-0102bf, 49347, 0) with no recent heart beats: 85563ms exc