Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Nick Pentreath
It shouldn't be too tricky to use the Spark job server to create a job where the SQL statement is an input argument, which is executed and the result returned. This gives remote server execution but no metastore layer— Sent from Mailbox for iPhone On Mon, Mar 31, 2014 at 6:56 AM, Manoj Samel wr

batching the output

2014-03-30 Thread Vipul Pandey
Hi, I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple "rows" in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same. e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records ins

Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi Aaron, unionAll is a workaround ... * unionAll preserve duplicate v/s union that does not * SQL union and unionAll result in same output format i.e. another SQL v/s different RDD types here. * Understand the existing union contract issue. This may be a class hierarchy discussion for SchemaRDD,

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-30 Thread Vipul Pandey
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. any word on this one? On Mar 27, 2014, at 6:41 PM, Kanwaldeep wrote: > We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar

Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Manoj Samel
Thanks Matei, Any thoughts of providing Standalone SharkServer equivalent on SparkSQL? Manoj On Sun, Mar 30, 2014 at 7:35 PM, Matei Zaharia wrote: > Hi Manoj, > > At the current time, for drop-in replacement of Hive, it will be best to > stick with Shark. Over time, Shark will use the Spark SQ

groupBy RDD does not have grouping column ?

2014-03-30 Thread Manoj Samel
Hi, If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the resulting RDD should have 'a, 'foo and 'bar. The result RDD just shows 'foo and 'bar and is missing 'a Thoughts? Thanks, Manoj

Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Matei Zaharia
Hi Manoj, At the current time, for drop-in replacement of Hive, it will be best to stick with Shark. Over time, Shark will use the Spark SQL backend, but should remain deployable the way it is today (including launching the SharkServer, using the Hive CLI, etc). Spark SQL is better for accessin

Re: SparkSQL "where" with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi, Would the same issue be present for other Java type like Date ? Converting the person/teenager example on Patricks page reproduces the problem ... Thanks, scala> import scala.math import scala.math scala> case class Person(name: String, age: BigDecimal) defined class Person scala> val pe

Re: SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Mayur Rustagi
+1 Have done a few installations of Shark with customers using Hive, they love it. Would be good to maintain compatibility with Metastore & QL till we have substantial reason to break off (like BlinkDB). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread Mayur Rustagi
The scala object needs to be sent to workers to be used as a RDD, parallalize is a way to do that. What are you looking to do? You can serialize the scala object to hdfs/disk & load it from thr Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Shivaram Venkataraman
That is a good idea, though I am not sure how much it will help as time to rsync is also dependent just on data size being copied. The other problem is that sometime we have dependencies across packages, so the first needs to be running before the second can start etc. However I agree that it takes

Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Aureliano Buendia
Hi, Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time. Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have

Re: Spark webUI - application details page

2014-03-30 Thread Patrick Wendell
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark applications can persist their state so that the UI can be reloaded after they have completed. - Patrick On Sun, Mar 30, 2014 at 10:30 AM, David Thomas wrote: > Is there a way to see 'Application Detail UI' page (at mast

Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Aaron Davidson
Looks like there is a "unionAll" function on SchemaRDD which will do what you want. The contract of RDD#union is unfortunately too general to allow it to return a SchemaRDD without downcasting. On Sun, Mar 30, 2014 at 7:56 AM, Manoj Samel wrote: > Hi, > > I am trying SparkSQL based on the exampl

Re: SparkSQL "where" with BigDecimal type gives stacktrace

2014-03-30 Thread Aaron Davidson
Well, the error is coming from this case statement not matching on the BigDecimal type: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L41 This seems to be a bug because there is a corresponding Catalyst DataType for BigD

Re: SparkSQL "where" with BigDecimal type gives stacktrace

2014-03-30 Thread smallmonkey...@hotmail.com
can I get the whole operation? then i can try to locate the error smallmonkey...@hotmail.com From: Manoj Samel Date: 2014-03-31 01:16 To: user Subject: SparkSQL "where" with BigDecimal type gives stacktrace Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to Doub

Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.

SparkSQL "where" with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to Double works ... scala> case class JournalLine(account: String, credit: BigDecimal, debit: BigDecimal, date: String, company: String, currency: String, costcenter: String, region: String) defined class JournalLine

Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi, I am trying SparkSQL based on the example on doc ... val people = sc.textFile("/data/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) val olderThanTeans = people.where('age > 19) val youngerThanTeans = people.where('age < 13) val

Error in SparkSQL Example

2014-03-30 Thread Manoj Samel
Hi, On http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html, I am trying to run code on "Writing Language-Integrated Relational Queries" ( I have 1.0.0 Snapshot ). I am running into error on val people: RDD[Person] // An RDD of case class objects, from the first example.

Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread yh18190
Hi, Can we convert directly scala collection to spark RDD data type without using parellize method? Is their any way to create custom converted RDD datatype from scala type using some typecast like that? Please suggest me -- View this message in context: http://apache-spark-user-list.1001

Re: Cross validation is missing in machine learning examples

2014-03-30 Thread Christopher Nguyen
Aureliano, you're correct that this is not "validation error", which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance. However, in this example, the errors are correctly referred to as "training error", which is what you might compute on a per-iterat

Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
In particular, we are using this dataset: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 Ankur On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave wrote: > The GraphX team has been using Wikipedia dumps from > http://dumps.wikimedia.org/enwik

Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
The GraphX team has been using Wikipedia dumps from http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less convenient format than the Freebase dumps. In particular, an article may span multiple lines, so more involved input parsing is required. Dan Crankshaw (cc'd) wrote a driver t