Mich: Please refer to the following test suite for examples on various DataFrame operations:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala On Mon, Feb 22, 2016 at 4:39 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Thanks Dean. > > I gather if I wanted to get the whole thing through FP with little or no > use of SQL, then for the first instance as I get the data set from Hive > (i.e, > > val rs = HiveContext.sql("""SELECT t.calendar_month_desc, c.channel_desc, > SUM(s.amount_sold) AS TotalSales > FROM smallsales s, times t, channels c > WHERE s.time_id = t.time_id > AND s.channel_id = c.channel_id > GROUP BY t.calendar_month_desc, c.channel_desc > """) > > I can even possibly use DF to do the above sql joins because what I am > doing above can also be done in DF without SQL use? Is that possible? Or I > have to use some form of SQL? > > > > The rest I can do simply using DFs. > > > > Thanks > > > > > > On 23/02/2016 00:26, Dean Wampler wrote: > > Kevin gave you the answer you need, but I'd like to comment on your > subject line. SQL is a limited form of FP. Sure, there are no anonymous > functions and other limitations, but it's declarative, like good FP > programs should be, and it offers an important subset of the operators > ("combinators") you want. > > Also, on a practical note, use the DataFrame API whenever you can, rather > than dropping down to the RDD API, because the DataFrame API is far more > performant. It's a classic case where restricting your options enables more > aggressive optimizations behind the scenes. Michal Armbrust's talk at Spark > Summit East nicely made this point. > http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming > > dean > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Mon, Feb 22, 2016 at 6:45 PM, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> In your example, the *rs* instance should be a DataFrame object. In >> other words, the result of *HiveContext.sql* is a DataFrame that you can >> manipulate using *filter, map, *etc. >> >> >> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext >> >> >> On Mon, Feb 22, 2016 at 5:16 PM, Mich Talebzadeh < >> mich.talebza...@cloudtechnologypartners.co.uk> wrote: >> >>> Hi, >>> >>> I have data stored in Hive tables that I want to do simple manipulation. >>> >>> Currently in Spark I perform the following with getting the result set >>> using SQL from Hive tables, registering as a temporary table in Spark >>> >>> Now Ideally I can get the result set into a DF and work on DF to slice >>> and dice the data using functional programming with filter, map. split etc. >>> >>> I wanted to get some ideas on how to go about it. >>> >>> thanks >>> >>> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >>> >>> HiveContext.sql("use oraclehadoop") >>> val rs = HiveContext.sql("""SELECT t.calendar_month_desc, >>> c.channel_desc, SUM(s.amount_sold) AS TotalSales >>> FROM smallsales s, times t, channels c >>> WHERE s.time_id = t.time_id >>> AND s.channel_id = c.channel_id >>> GROUP BY t.calendar_month_desc, c.channel_desc >>> """) >>> *rs.registerTempTable("tmp")* >>> >>> >>> HiveContext.sql(""" >>> SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales >>> from tmp >>> ORDER BY MONTH, CHANNEL >>> """).collect.foreach(println) >>> HiveContext.sql(""" >>> SELECT channel_desc AS CHANNEL, MAX(TotalSales) AS SALES >>> FROM tmp >>> GROUP BY channel_desc >>> order by SALES DESC >>> """).collect.foreach(println) >>> >>> >>> -- >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> http://talebzadehmich.wordpress.com >>> >>> NOTE: The information in this email is proprietary and confidential. This >>> message is for the designated recipient only, if you are not the intended >>> recipient, you should destroy it immediately. Any information in this >>> message shall not be understood as given or endorsed by Cloud Technology >>> Partners Ltd, its subsidiaries or their employees, unless expressly so >>> stated. It is the responsibility of the recipient to ensure that this email >>> is virus free, therefore neither Cloud Technology partners Ltd, its >>> subsidiaries nor their employees accept any responsibility. >>> >>> >>> > > > -- > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Cloud Technology Partners > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus free, > therefore neither Cloud Technology partners Ltd, its subsidiaries nor their > employees accept any responsibility. > > >