Your SQL query will look something like that in DataFrames (but as Ted said, check the docs to see the signatures).
smallsales .join(times,"time_id") .join(channels,"channel_id") .groupBy("calendar_month_desc","channel_desc") .agg(sum(col("amount_sold")).as("TotalSales"), "calendar_month_desc","channel_desc") On 23 February 2016 at 01:50, Ted Yu <yuzhih...@gmail.com> wrote: > Mich: > Please refer to the following test suite for examples on various DataFrame > operations: > > sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala > > On Mon, Feb 22, 2016 at 4:39 PM, Mich Talebzadeh < > mich.talebza...@cloudtechnologypartners.co.uk> wrote: > >> Thanks Dean. >> >> I gather if I wanted to get the whole thing through FP with little or no >> use of SQL, then for the first instance as I get the data set from Hive >> (i.e, >> >> val rs = HiveContext.sql("""SELECT t.calendar_month_desc, c.channel_desc, >> SUM(s.amount_sold) AS TotalSales >> FROM smallsales s, times t, channels c >> WHERE s.time_id = t.time_id >> AND s.channel_id = c.channel_id >> GROUP BY t.calendar_month_desc, c.channel_desc >> """) >> >> I can even possibly use DF to do the above sql joins because what I am >> doing above can also be done in DF without SQL use? Is that possible? Or I >> have to use some form of SQL? >> >> >> >> The rest I can do simply using DFs. >> >> >> >> Thanks >> >> >> >> >> >> On 23/02/2016 00:26, Dean Wampler wrote: >> >> Kevin gave you the answer you need, but I'd like to comment on your >> subject line. SQL is a limited form of FP. Sure, there are no anonymous >> functions and other limitations, but it's declarative, like good FP >> programs should be, and it offers an important subset of the operators >> ("combinators") you want. >> >> Also, on a practical note, use the DataFrame API whenever you can, rather >> than dropping down to the RDD API, because the DataFrame API is far more >> performant. It's a classic case where restricting your options enables more >> aggressive optimizations behind the scenes. Michal Armbrust's talk at Spark >> Summit East nicely made this point. >> http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming >> >> dean >> >> Dean Wampler, Ph.D. >> Author: Programming Scala, 2nd Edition >> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) >> Typesafe <http://typesafe.com> >> @deanwampler <http://twitter.com/deanwampler> >> http://polyglotprogramming.com >> >> On Mon, Feb 22, 2016 at 6:45 PM, Kevin Mellott <kevin.r.mell...@gmail.com >> > wrote: >> >>> In your example, the *rs* instance should be a DataFrame object. In >>> other words, the result of *HiveContext.sql* is a DataFrame that you >>> can manipulate using *filter, map, *etc. >>> >>> >>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext >>> >>> >>> On Mon, Feb 22, 2016 at 5:16 PM, Mich Talebzadeh < >>> mich.talebza...@cloudtechnologypartners.co.uk> wrote: >>> >>>> Hi, >>>> >>>> I have data stored in Hive tables that I want to do simple manipulation. >>>> >>>> Currently in Spark I perform the following with getting the result set >>>> using SQL from Hive tables, registering as a temporary table in Spark >>>> >>>> Now Ideally I can get the result set into a DF and work on DF to slice >>>> and dice the data using functional programming with filter, map. split etc. >>>> >>>> I wanted to get some ideas on how to go about it. >>>> >>>> thanks >>>> >>>> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >>>> >>>> HiveContext.sql("use oraclehadoop") >>>> val rs = HiveContext.sql("""SELECT t.calendar_month_desc, >>>> c.channel_desc, SUM(s.amount_sold) AS TotalSales >>>> FROM smallsales s, times t, channels c >>>> WHERE s.time_id = t.time_id >>>> AND s.channel_id = c.channel_id >>>> GROUP BY t.calendar_month_desc, c.channel_desc >>>> """) >>>> *rs.registerTempTable("tmp")* >>>> >>>> >>>> HiveContext.sql(""" >>>> SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales >>>> from tmp >>>> ORDER BY MONTH, CHANNEL >>>> """).collect.foreach(println) >>>> HiveContext.sql(""" >>>> SELECT channel_desc AS CHANNEL, MAX(TotalSales) AS SALES >>>> FROM tmp >>>> GROUP BY channel_desc >>>> order by SALES DESC >>>> """).collect.foreach(println) >>>> >>>> >>>> -- >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> http://talebzadehmich.wordpress.com >>>> >>>> NOTE: The information in this email is proprietary and confidential. This >>>> message is for the designated recipient only, if you are not the intended >>>> recipient, you should destroy it immediately. Any information in this >>>> message shall not be understood as given or endorsed by Cloud Technology >>>> Partners Ltd, its subsidiaries or their employees, unless expressly so >>>> stated. It is the responsibility of the recipient to ensure that this >>>> email is virus free, therefore neither Cloud Technology partners Ltd, its >>>> subsidiaries nor their employees accept any responsibility. >>>> >>>> >>>> >> >> >> -- >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> http://talebzadehmich.wordpress.com >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Cloud Technology >> Partners Ltd, its subsidiaries or their employees, unless expressly so >> stated. It is the responsibility of the recipient to ensure that this email >> is virus free, therefore neither Cloud Technology partners Ltd, its >> subsidiaries nor their employees accept any responsibility. >> >> >> >