Your SQL query will look something like that in DataFrames (but as Ted
said, check the docs to see the signatures).

smallsales
.join(times,"time_id")
.join(channels,"channel_id")
.groupBy("calendar_month_desc","channel_desc")
.agg(sum(col("amount_sold")).as("TotalSales"),
"calendar_month_desc","channel_desc")

On 23 February 2016 at 01:50, Ted Yu <yuzhih...@gmail.com> wrote:

> Mich:
> Please refer to the following test suite for examples on various DataFrame
> operations:
>
> sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
>
> On Mon, Feb 22, 2016 at 4:39 PM, Mich Talebzadeh <
> mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>
>> Thanks Dean.
>>
>> I gather if I wanted to get the whole thing through FP with little or no
>> use of SQL, then for the first instance as I get the data set from Hive
>> (i.e,
>>
>> val rs = HiveContext.sql("""SELECT t.calendar_month_desc, c.channel_desc,
>> SUM(s.amount_sold) AS TotalSales
>> FROM smallsales s, times t, channels c
>> WHERE s.time_id = t.time_id
>> AND   s.channel_id = c.channel_id
>> GROUP BY t.calendar_month_desc, c.channel_desc
>> """)
>>
>> I can even possibly use DF to do the above sql joins because what I am
>> doing above can also be done in DF without SQL use? Is that possible? Or I
>> have to use some form of SQL?
>>
>>
>>
>> The rest I can do simply using DFs.
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> On 23/02/2016 00:26, Dean Wampler wrote:
>>
>> Kevin gave you the answer you need, but I'd like to comment on your
>> subject line. SQL is a limited form of FP. Sure, there are no anonymous
>> functions and other limitations, but it's declarative, like good FP
>> programs should be, and it offers an important subset of the operators
>> ("combinators") you want.
>>
>> Also, on a practical note, use the DataFrame API whenever you can, rather
>> than dropping down to the RDD API, because the DataFrame API is far more
>> performant. It's a classic case where restricting your options enables more
>> aggressive optimizations behind the scenes. Michal Armbrust's talk at Spark
>> Summit East nicely made this point.
>> http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming
>>
>> dean
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>> Typesafe <http://typesafe.com>
>> @deanwampler <http://twitter.com/deanwampler>
>> http://polyglotprogramming.com
>>
>> On Mon, Feb 22, 2016 at 6:45 PM, Kevin Mellott <kevin.r.mell...@gmail.com
>> > wrote:
>>
>>> In your example, the *rs* instance should be a DataFrame object. In
>>> other words, the result of *HiveContext.sql* is a DataFrame that you
>>> can manipulate using *filter, map, *etc.
>>>
>>>
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>>
>>>
>>> On Mon, Feb 22, 2016 at 5:16 PM, Mich Talebzadeh <
>>> mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have data stored in Hive tables that I want to do simple manipulation.
>>>>
>>>> Currently in Spark I perform the following with getting the result set
>>>> using SQL from Hive tables, registering as a temporary table in Spark
>>>>
>>>> Now Ideally I can get the result set into a DF and work on DF to slice
>>>> and dice the data using functional programming with filter, map. split etc.
>>>>
>>>> I wanted to get some ideas on how to go about it.
>>>>
>>>> thanks
>>>>
>>>> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>>
>>>> HiveContext.sql("use oraclehadoop")
>>>> val rs = HiveContext.sql("""SELECT t.calendar_month_desc,
>>>> c.channel_desc, SUM(s.amount_sold) AS TotalSales
>>>> FROM smallsales s, times t, channels c
>>>> WHERE s.time_id = t.time_id
>>>> AND   s.channel_id = c.channel_id
>>>> GROUP BY t.calendar_month_desc, c.channel_desc
>>>> """)
>>>> *rs.registerTempTable("tmp")*
>>>>
>>>>
>>>> HiveContext.sql("""
>>>> SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
>>>> from tmp
>>>> ORDER BY MONTH, CHANNEL
>>>> """).collect.foreach(println)
>>>> HiveContext.sql("""
>>>> SELECT channel_desc AS CHANNEL, MAX(TotalSales)  AS SALES
>>>> FROM tmp
>>>> GROUP BY channel_desc
>>>> order by SALES DESC
>>>> """).collect.foreach(println)
>>>>
>>>>
>>>> --
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> NOTE: The information in this email is proprietary and confidential. This 
>>>> message is for the designated recipient only, if you are not the intended 
>>>> recipient, you should destroy it immediately. Any information in this 
>>>> message shall not be understood as given or endorsed by Cloud Technology 
>>>> Partners Ltd, its subsidiaries or their employees, unless expressly so 
>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>> email is virus free, therefore neither Cloud Technology partners Ltd, its 
>>>> subsidiaries nor their employees accept any responsibility.
>>>>
>>>>
>>>>
>>
>>
>> --
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> http://talebzadehmich.wordpress.com
>>
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Cloud Technology 
>> Partners Ltd, its subsidiaries or their employees, unless expressly so 
>> stated. It is the responsibility of the recipient to ensure that this email 
>> is virus free, therefore neither Cloud Technology partners Ltd, its 
>> subsidiaries nor their employees accept any responsibility.
>>
>>
>>
>

Reply via email to