Re: Spark Example Project, runnable on EMR, open sourced

2014-04-17 Thread Alex Dean
Hi Parviz, Yes certainly - by all means! Thanks for all your hard work making it easy to run Spark on EMR... Cheers, Alex On Fri, Apr 18, 2014 at 5:27 AM, Pdeyhim wrote: > Awesome! Ok if we mention this on aws blog? > > Sent from my iPad > > On Apr 17, 2014, at 7:10 AM, Alex Dean wrote: > >

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-17 Thread gaganbm
It happens with normal data rate, i.e., lets say 20 records per second. Apart from that, I am also getting some more strange behavior. Let me explain. I establish two sscs. Start them one after another. In SSCs I get the streams from Kafka sources, and do some manipulations. Like, adding some "Re

join with inputs co-partitioned?

2014-04-17 Thread Joe L
I am trying to implement joining with co-partitioned inputs. As described in the documentation, we can avoid shuffling by partitioning elements with the same hash code into the same machine. >>> links = >>> sc.parallelize([('a','b'),('a','c'),('b','c'),('c','a')]).groupByKey(3) >>> links.glom().co

Re: Ooyala Server - plans to merge it into Apache ?

2014-04-17 Thread Azuryy Yu
Hi, Good to know. does that Ooyala spark job server can run on the Yarn? is there a job scheduler? On Fri, Apr 18, 2014 at 12:12 PM, All In A Days Work < allinadays...@gmail.com> wrote: > Hi, > > In 2013 spark summit, Ooyala had presented their spark Job server and > indicated that they wanted

Re: Spark Example Project, runnable on EMR, open sourced

2014-04-17 Thread Pdeyhim
Awesome! Ok if we mention this on aws blog? Sent from my iPad > On Apr 17, 2014, at 7:10 AM, Alex Dean wrote: > > Hi all, > > Just a quick email to share a new GitHub project we've just released at > Snowplow: > > https://github.com/snowplow/spark-example-project > > It's an example Scala &

Ooyala Server - plans to merge it into Apache ?

2014-04-17 Thread All In A Days Work
Hi, In 2013 spark summit, Ooyala had presented their spark Job server and indicated that they wanted open source the work. Is there any plan to merge this functionality into the spark offering itself - and not offer just as Ooyala open sourced version ? Thanks,

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
got it, thank you. On Fri, Apr 18, 2014 at 9:55 AM, Cheng Lian wrote: > Ah, I’m not saying println is bad, it’s just that you need to go to the > right place to locate the output, e.g. you can check stdout of any executor > from the Web UI. > > > On Fri, Apr 18, 2014 at 9:48 AM, 诺铁 wrote: > >>

Re: distinct on huge dataset

2014-04-17 Thread Mayur Rustagi
Preferably increase the ulimit on your machines. Spark needs to access a lot of small files hence hard to control file handlers. — Sent from Mailbox On Fri, Apr 18, 2014 at 3:59 AM, Ryan Compton wrote: > Btw, I've got System.setProperty("spark.shuffle.consolidate.files", > "true") and use ex

Re: confused by reduceByKey usage

2014-04-17 Thread Cheng Lian
Ah, I’m not saying println is bad, it’s just that you need to go to the right place to locate the output, e.g. you can check stdout of any executor from the Web UI. On Fri, Apr 18, 2014 at 9:48 AM, 诺铁 wrote: > hi,Cheng, > > thank you for let me know this. so what do you think is better way to

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
hi,Cheng, thank you for let me know this. so what do you think is better way to debug? On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian wrote: > A tip: using println is only convenient when you are working with local > mode. When running Spark in clustering mode (standalone/YARN/Mesos), output >

Re: confused by reduceByKey usage

2014-04-17 Thread Cheng Lian
A tip: using println is only convenient when you are working with local mode. When running Spark in clustering mode (standalone/YARN/Mesos), output of println goes to executor stdout. On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 wrote: > yeah, I got it.! > using println to debug is great for me to explo

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Yes, it should be data specific and perhaps we're biased toward the data sets that we are playing with. To put things in perspective, we're highly interested in (and I believe, our customers are): 1. large (hundreds of millions of rows) 2. multi-class classification - nowadays, dozens of target ca

Re: Valid spark streaming use case?

2014-04-17 Thread Tathagata Das
This is a good usecase for using DStream.updateStateByKey! This allows you to maintain arbitrary per-key state. Checkout this example. https://github.com/tdas/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala Also take a look at the docume

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
What kind of data are you training on? These effects are *highly* data dependent, and while saying "the depth of 10 is simply not adequate to build high-accuracy models" may be accurate for the particular problem you're modeling, it is not true in general. From a statistical perspective, I consider

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
yeah, I got it.! using println to debug is great for me to explore spark. thank you very much for your kindly help. On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Here's a way to debug something like this: > > scala> d5.keyBy(_.split(" ")(0)).reduc

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Btw, I've got System.setProperty("spark.shuffle.consolidate.files", "true") and use ext3 (CentOS...) On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton wrote: > Does this continue in newer versions? (I'm on 0.8.0 now) > > When I use .distinct() on moderately large datasets (224GB, 8.5B rows, > I'm gue

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException java.io.File

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
I believe that they show one example comparing depth 1 ensemble vs depth 3 ensemble but it is based on boosting, not bagging. On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das wrote: > Evan, > > Was not mllib decision tree implemented using ideas from Google's PLANET > paper...do the paper also prop

Re: RDD collect help

2014-04-17 Thread Flavio Pompermaier
Thanks again Eugen! I don't get the point..why you prefer to avoid kyro ser for closures?is there any problem with that? On Apr 17, 2014 11:10 PM, "Eugen Cepoi" wrote: > You have two kind of ser : data and closures. They both use java ser. This > means that in your function you reference an objec

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das
Evan, Was not mllib decision tree implemented using ideas from Google's PLANET paper...do the paper also propose to grow a shallow tree ? Thanks. Deb On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung wrote: > Additionally, the 'random features per node' (or mtry in R) is a very > important feat

Re: RDD collect help

2014-04-17 Thread Eugen Cepoi
You have two kind of ser : data and closures. They both use java ser. This means that in your function you reference an object outside of it and it is getting ser with your task. To enable kryo ser for closures set spark.closure.serializer property. But usualy I dont as it allows me to detect such

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Additionally, the 'random features per node' (or mtry in R) is a very important feature for Random Forest. The variance reduction comes if the trees are decorrelated from each other and often the random features per node does more than bootstrap samples. And this is something that would have to be

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Well, if you read the original paper, http://oz.berkeley.edu/~breiman/randomforest2001.pdf "Grow the tree using CART methodology to maximum size and do not prune." Now, the elements of statistical learning book on page 598 says that you could potentially overfit fully-grown regression random fores

Valid spark streaming use case?

2014-04-17 Thread xargsgrep
Hi, I'm completely new to Spark streaming (and Spark) and have been reading up on it and trying out various examples the past few days. I have a particular use case which I think it would work well for, but I wanted to put it out there and get some feedback on whether or not it actually would. The

Re: RDD collect help

2014-04-17 Thread Flavio Pompermaier
Now I have another problem..I have to pass one o this non serializable object to a PairFunction and I received another non serializable exception..it seems that Kyro doesn't work within Functions. Am I wrong or this is a limit of Spark? On Apr 15, 2014 1:36 PM, "Flavio Pompermaier" wrote: > Ok th

writing booleans w Calliope

2014-04-17 Thread Adrian Mocanu
Has anyone managed to write Booleans to Cassandra from an RDD with Calliope? My Booleans give compile time errors: expression of type List[Any] does not conform to expected type Types.CQLRowValues CQLColumnValue is definted as ByteBuffer: type CQLColumnValue = ByteBuffer For now I convert them to

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread Sean Owen
Oh dear I read this as a build problem. I can build with the latest Java 7, including those versions of Spark and Mesos, no problem. I did not deploy them. Mesos does have some native libraries, so it might well be some kind of compatibility issue at that level. Anything more in the error log that

RE: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread Steven Cox
Sure. Here it is. Pretty sure it's something else. Any suggestions on other avenues to investigate from folks who've seen this? # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f543716cce9, pid=8260, tid=13226316544 # # JRE version: J

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
If you can test it quickly, an option would be to try with the exact same version that Sean used (1.7.0_51) ? Maybe it was a bug fixed in 51 and a regression has been introduced in 55 :-D Andy On Thu, Apr 17, 2014 at 9:36 PM, Steven Cox wrote: > FYI, I've tried older versions (jdk6.x), openjdk

RE: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread Steven Cox
FYI, I've tried older versions (jdk6.x), openjdk. Also here's a fresh core dump on jdk7u55-b13: # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f7c6b718d39, pid=7708, tid=140171900581632 # # JRE version: Java(TM) SE Runtime Environment (7.0_

Re: Continuously running non-streaming jobs

2014-04-17 Thread Daniel Darabos
I'm quite new myself (just subscribed to the mailing list today :)), but this happens to be something we've had success with. So let me know if you hit any problems with this sort of usage. On Thu, Apr 17, 2014 at 9:11 PM, Jim Carroll wrote: > Daniel, > > I'm new to Spark but I thought that thr

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Hmm... can you provide some pointers to examples where deep trees are helpful? Typically with Decision Trees you limit depth (either directly or indirectly with minimum node size and minimum improvement criteria) to avoid overfitting. I agree with the assessment that forests are a variance reducti

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
No of course, but I was guessing some native libs imported (to communicate with Mesos) in the project that... could miserably crash the JVM. Anyway, so you tell us that using this oracle version, you don't have any issues when using spark on mesos 0.18.0, that's interesting 'cause AFAIR, my last t

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Evan, I actually haven't heard of 'shallow' random forest. I think that the only scenarios where shallow trees are useful are boosting scenarios. AFAIK, Random Forest is a variance reducing technique and doesn't do much about bias (although some people claim that it does have some bias reducing e

Re: Continuously running non-streaming jobs

2014-04-17 Thread Jim Carroll
Daniel, I'm new to Spark but I thought that thread hinted at the right answer. Thanks, Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Continuously-running-non-streaming-jobs-tp4391p4397.html Sent from the Apache Spark User List mailing list archive

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Sorry - I meant to say that "Multiclass classification, Gradient Boosting, and Random Forest support based on the recent Decision Tree implementation in MLlib is planned and coming soon." On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote: > Multiclass classification, Gradient Boosting, and

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread Sean Owen
I don't know if it's anything you or the project is missing... that's just a JDK bug. FWIW I am on 1.7.0_51 and have not seen anything like that. I don't think it's a protobuf issue -- you don't crash the JVM with simple version incompatibilities :) -- Sean Owen | Director, Data Science | London

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
Hyea, I still have to try it myself (I'm trying to create GCE images with Spark on Mesos 0.18.0) but I think your change is one of the required ones, however my gut feeling is that others will be required to have this working. Actually, in my understanding, this core dump is due to protobuf incom

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Multiclass classification, Gradient Boosting, and Random Forest support for based on the recent Decision Tree implementation in MLlib. Sung - I'd be curious to hear about your use of decision trees (and forests) where you want to go to 100+ depth. My experience with random forests has been that pe

Re: Continuously running non-streaming jobs

2014-04-17 Thread Daniel Darabos
The linked thread does a good job answering your question. You should create a SparkContext at startup and re-use it for all of your queries. For example we create a SparkContext in a web server at startup, and are then able to use the Spark cluster for serving Ajax queries with latency of a second

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Debasish, we've tested the MLLib decision tree a bit and it eats up too much memory for RF purposes. Once the tree got to depth 8~9, it was easy to get heap exception, even with 2~4 GB of memory per worker. With RF, it's very easy to get 100+ depth in RF with even only 100,000+ rows (because trees

Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread Steven Cox
So I tried a fix found on the list... "The issue was due to meos version mismatch as I am using latest mesos 0.17.0, but spark uses 0.13.0. Fixed by updating the SparkBuild.scala to latest version." I changed this line in SparkBuild.scala "org.apache.mesos" % "mesos"

Continuously running non-streaming jobs

2014-04-17 Thread Jim Carroll
Is there a way to create continuously-running, or at least continuously-loaded, jobs that can be 'invoked' rather than 'sent' to to avoid the job creation overhead of a couple seconds? I read through the following: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das
Mllib has decision treethere is a rf pr which is not active nowtake that and swap the tree builder with the fast tree builder that's in mllib...search for the spark jira...the code is based on google planet paper. .. I am sure people in devlist are already working on it...send an email to

Re: confused by reduceByKey usage

2014-04-17 Thread Daniel Darabos
Here's a way to debug something like this: scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => { println("v1: " + v1) println("v2: " + v2) (v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString }).collect You get: v1: 1 2 3 4 5 v2: 1 2 3 4 5 v1: 4 v

confused by reduceByKey usage

2014-04-17 Thread 诺铁
HI, I am new to spark,when try to write some simple tests in spark shell, I met following problem. I create a very small text file,name it as 5.txt 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 and experiment in spark shell: scala> val d5 = sc.textFile("5.txt").cache() d5: org.apache.spark.rdd.RDD[String] = Ma

Re: Spark program thows OutOfMemoryError

2014-04-17 Thread yypvsxf19870706
how many tasks are there in your job? 发自我的 iPhone 在 2014-4-17,16:24,Qin Wei 写道: > Hi, Andre, thanks a lot for you reply, but i still get the same exception, > the complete exception message is as below: > > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 1.0:9

Spark Example Project, runnable on EMR, open sourced

2014-04-17 Thread Alex Dean
Hi all, Just a quick email to share a new GitHub project we've just released at Snowplow: https://github.com/snowplow/spark-example-project It's an example Scala & SBT project which can assemble a fat jar ready for running on Amazon Elastic MapReduce. It includes Specs2 tests too. The blog post

Re: Shark: ClassNotFoundException org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

2014-04-17 Thread Gerd Koenig
Hi Arpit, I didn't build it, I am using the prebuild version described here: http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html including adding e.g. the mentioned jar br...Gerd... On 17 April 2014 15:49, Arpit Tak wrote: > Just for curiosity , as you are using Cloudera-Mana

Re: Shark: ClassNotFoundException org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

2014-04-17 Thread Arpit Tak
Just for curiosity , as you are using Cloudera-Manager hadoop and spark.. How you build shark .for it?? are you able to read any file from hdfs ...did you tried that out..??? Regards, Arpit Tak On Thu, Apr 17, 2014 at 7:07 PM, ge ko wrote: > Hi, > > the error java.lang.ClassNotFoundE

Re: Shark: ClassNotFoundException org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

2014-04-17 Thread ge ko
Hi, the error java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat has been resolved by adding parquet-hive-bundle-1.4.1.jar to shark's lib folder. Now the Hive metastore can be read successfully (also the parquet based table). But if I want to select fr

Re: Spark on Yarn or Mesos?

2014-04-17 Thread Arpit Tak
Hi Wel, Take a look at this post... http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-td2016.html Regards, Arpit Tak On Thu, Apr 17, 2014 at 3:42 PM, Wei Wang wrote: > Hi, there > > I would like to know is there any differences

Spark on Yarn or Mesos?

2014-04-17 Thread Wei Wang
Hi, there I would like to know is there any differences between Spark on Yarn and Spark on Mesos. Is there any comparision between them? What are the advantages and disadvantages for each of them. Is there any criterion for choosing between Yarn and Mesos? BTW, we need MPI in our framework, and I

Shark: ClassNotFoundException org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

2014-04-17 Thread ge ko
Hi, I want to select from a parquet based table in shark, but receive the error: shark> select * from wl_parquet; 14/04/17 11:33:49 INFO shark.SharkCliDriver: Execution Mode: shark 14/04/17 11:33:49 INFO ql.Driver: 14/04/17 11:33:49 INFO ql.Driver: 14/04/17 11:33:49 INFO ql.Driver: 14/04/17 11

Re: what is the difference between element and partition?

2014-04-17 Thread wxhsdp
what do you mean by "element"? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317p4378.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Shark: class java.io.IOException: Cannot run program "/bin/java"

2014-04-17 Thread Gerd Koenig
thanks Arpit, gotcha ;) On 16 April 2014 20:08, Arpit Tak wrote: > just set your java class path properly > > export JAVA_HOME=/usr/lib/jvm/java-7-. (somewhat like this...whatever > version you having) > > it will work > > Regards, > Arpit > > > On Wed, Apr 16, 2014 at 1:24 AM, ge ko w

Re: groupByKey(None) returns partitions according to the keys?

2014-04-17 Thread wxhsdp
No, partition number is determined by the parameter you set in groupByKey, see http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions for details, suggest you reading some docs before ask questions Joe L wrote > I was wonder if groupByKey returns 2 partition

Random Forest on Spark

2014-04-17 Thread Laeeq Ahmed
Hi, For one of my application, I want to use Random forests(RF) on top of spark. I see that currenlty MLLib does not have implementation for RF. What other opensource RF implementations will be great to use with spark in terms of speed? Regards, Laeeq Ahmed, KTH, Sweden.

Re: PySpark still reading only text?

2014-04-17 Thread Bertrand Dechoux
According to the Spark SQL documentation, indeed, this project allows python to be used while reading/writing table ie data which not necessarily in text format. Thanks a lot! Bertrand Dechoux On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux wrote: > Thanks for the IRA reference. I really ne

Re: using saveAsNewAPIHadoopFile with OrcOutputFormat

2014-04-17 Thread Nick Pentreath
ES formats are pretty easy to use: Reading: val conf = new Configuration() conf.set("es.resource", "index/type") conf.set("es.query", "?q=*") val rdd = sc.newAPIHadoopRDD( conf, classOf[EsInputFormat[NullWritable, LinkedMapWritable]], classOf[NullWritable], classOf[LinkedMapWritable] ) The only g

Re: Using google cloud storage for spark big data

2014-04-17 Thread Andras Nemeth
Hello! On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia wrote: > Hi, > > Google has publisheed a new connector for hadoop: google cloud storage, > which is an equivalent of amazon s3: > > > googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html > This

Re: Re: Spark program thows OutOfMemoryError

2014-04-17 Thread Qin Wei
Hi, Andre, thanks a lot for you reply, but i still get the same exception, the complete exception message is as below: Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 1.0:9 failed 4 times (most recent failure: Exception failure: java.lang.OutOfMemoryError: Jav

Re: Create cache fails on first time

2014-04-17 Thread Andre Bois-Crettez
It could be GC issue, first time it triggers a full GC that takes too much time ? Make sure you have Xms and Xms at the same values, and try -XX:+UseConcMarkSweepGC And analyse GC logs. André Bois-Crettez On 2014-04-16 16:44, Arpit Tak wrote: I am loading some data(25GB) in shark from hdfs : sp

Re: PySpark still reading only text?

2014-04-17 Thread Bertrand Dechoux
Thanks for the IRA reference. I really need to look at Spark SQL. Am I right to understand that due to Spark SQL, hive data can be read (and it does not need to be a text format) and then 'classical' Spark can work on this extraction? It seems logical but I haven't worked with Spark SQL as of now