Re: dataframe average error: Float does not take parameters

2015-10-21 Thread Carol McDonald
DataFrame = [min(count): bigint, avg(count): > double] > > scala> res.show > +--+--+ > |min(count)|avg(count)| > +--+--+ > | 1| 1.0| > +--+--+ > > > scala> res.printSchema > root > |-- min(coun

dataframe average error: Float does not take parameters

2015-10-21 Thread Carol McDonald
This used to work : // What's the min number of bids per item? what's the average? what's the max? auction.groupBy("item", "auctionid").count.agg(min("count"), avg("count"),max("count")).show // MIN(count) AVG(count)MAX(count) // 1 16.992025518341308 75 but this now gives an error val

Re: Top 10 count

2015-10-20 Thread Carol McDonald
// sort by 2nd element Sorting.quickSort(pairs)(Ordering.by[(String, Int, Int), Int](_._2)) // sort by the 3rd element, then 1st Sorting.quickSort(pairs)(Ordering[(Int, String)].on(x => (x._3, x._1))) On Tue, Oct 20, 2015 at 11:33 AM, Carol McDonald wrote: > this works > &g

Re: Top 10 count

2015-10-20 Thread Carol McDonald
n write "Ordering.by(_._2)" to be more concise > (not 100% sure about the syntax off the top of my head). > > > > On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald > wrote: > >> To find the top 10 counts , which is better using top(10) with Ordering >> on the

Top 10 count

2015-10-20 Thread Carol McDonald
To find the top 10 counts , which is better using top(10) with Ordering on the value, or swapping the key value and ordering on the key ? For example which is better below ? Or does it matter val top10 = logs.filter(log => log.responseCode != 200).map(log => (log.endpoint, 1)).reduceByKey(_ + _)

Re: correct use of DStream foreachRDD

2015-08-28 Thread Carol McDonald
nvertToPut)" should be sufficient. In slightly > older versions of Spark you have to import SparkContext._ to get these > implicits.) > > On Fri, Aug 28, 2015 at 3:29 PM, Carol McDonald > wrote: > > I would like to make sure that I am using the DStream foreachRDD > opera

correct use of DStream foreachRDD

2015-08-28 Thread Carol McDonald
I would like to make sure that I am using the DStream foreachRDD operation correctly. I would like to read from a DStream transform the input and write to HBase. The code below works , but I became confused when I read "Note that the function *func* is executed in the driver process" ? val

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Carol McDonald
I agree, I found this book very useful for getting started with spark and eclipse On Tue, Jul 28, 2015 at 11:10 AM, Petar Zecevic wrote: > > Sorry about self-promotion, but there's a really nice tutorial for setting > up Eclipse for Spark in "Spark in Action" book: > http://www.manning.com/bona

Re: dataframes sql order by not total ordering

2015-07-21 Thread Carol McDonald
therwise subsequent > operations (such as the join) could reorder the tuples. > > On Mon, Jul 20, 2015 at 9:25 AM, Carol McDonald > wrote: > >> the following query on the Movielens dataset , is sorting by the count of >> ratings for a movie. It looks like the results are or

dataframes sql order by not total ordering

2015-07-20 Thread Carol McDonald
the following query on the Movielens dataset , is sorting by the count of ratings for a movie. It looks like the results are ordered by partition ? scala> val results =sqlContext.sql("select movies.title, movierates.maxr, movierates.minr, movierates.cntu from(SELECT ratings.product, max(ratings.

Re: ALS run method versus ALS train versus ALS fit and transform

2015-07-17 Thread Carol McDonald
t; API. Similar ideas, > but a different API. > > On Wed, Jul 15, 2015 at 9:55 PM, Carol McDonald > wrote: > > In the Spark mllib examples MovieLensALS.scala ALS run is used, however > in > > the movie recommendation with mllib tutorial ALS train is used , What is > th

ALS run method versus ALS train versus ALS fit and transform

2015-07-15 Thread Carol McDonald
In the Spark mllib examples MovieLensALS.scala ALS run is used, however in the movie recommendation with mllib tutorial ALS train is used , What is the difference, when should you use one versus the other val model = new ALS() .setRank(params.rank) .setIterations(params.numIterati