Re: Using functional programming rather than SQL

2016-02-24 Thread Mich Talebzadeh
This is a point that I like to clarify please. These are my assumptions:. * Data resides in Hive tables in a Hive database * Data has to be extracted from these tables. Tables are ORC so they have ORC optimizations (Storage indexes, file, stride (64MB chunks of data) , rowset

Re: Using functional programming rather than SQL

2016-02-24 Thread Mich Talebzadeh
Hi Koert, My bad. I used a smaller size "sales" table in SQL plan. Kindly see my new figures. On 24/02/2016 20:05, Koert Kuipers wrote: > my assumption, which is apparently incorrect, was that the SQL gets > translated into a catalyst plan that is executed in spark. the dataframe > opera

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
my assumption, which is apparently incorrect, was that the SQL gets translated into a catalyst plan that is executed in spark. the dataframe operations (referred to by Mich as the FP results) also get translated into a catalyst plan that is executed on the exact same spark platform. so unless the S

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
i am still missing something. if it is executed in the source database, which is hive in this case, then it does need hive, no? how can you execute in hive without needing hive? On Wed, Feb 24, 2016 at 1:25 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > I never said it needs

Re: Using functional programming rather than SQL

2016-02-24 Thread Mohannad Ali
My apologies I definitely misunderstood. You are 100% correct. On Feb 24, 2016 19:25, "Sabarish Sasidharan" < sabarish.sasidha...@manthan.com> wrote: > I never said it needs one. All I said is that when calling context.sql() > the sql is executed in the source database (assuming datasource is Hive

Re: Using functional programming rather than SQL

2016-02-24 Thread Sabarish Sasidharan
I never said it needs one. All I said is that when calling context.sql() the sql is executed in the source database (assuming datasource is Hive or some RDBMS) Regards Sab Regards Sab On 24-Feb-2016 11:49 pm, "Mohannad Ali" wrote: > That is incorrect HiveContext does not need a hive instance to

Re: Using functional programming rather than SQL

2016-02-24 Thread Mohannad Ali
That is incorrect HiveContext does not need a hive instance to run. On Feb 24, 2016 19:15, "Sabarish Sasidharan" < sabarish.sasidha...@manthan.com> wrote: > Yes > > Regards > Sab > On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote: > >> are you saying that HiveContext.sql(...) runs on hive, and not o

Re: Using functional programming rather than SQL

2016-02-24 Thread Sabarish Sasidharan
Yes Regards Sab On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote: > are you saying that HiveContext.sql(...) runs on hive, and not on spark > sql? > > On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> When using SQL your full query, including the

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
are you saying that HiveContext.sql(...) runs on hive, and not on spark sql? On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > When using SQL your full query, including the joins, were executed in > Hive(or RDBMS) and only the results were brought in

Re: Using functional programming rather than SQL

2016-02-23 Thread Sabarish Sasidharan
When using SQL your full query, including the joins, were executed in Hive(or RDBMS) and only the results were brought into the Spark cluster. In the FP case, the data for the 3 tables is first pulled into the Spark cluster and then the join is executed. Thus the time difference. It's not immedia

Re: Using functional programming rather than SQL

2016-02-23 Thread Koert Kuipers
​instead of: var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM sales") you should be able to do something like: val s = HiveContext.table("sales").select("AMOUNT_SOLD", "TIME_ID", "CHANNEL_ID") its not obvious to me why the dataframe (aka FP) version would be significantly slow

Re: Using functional programming rather than SQL

2016-02-23 Thread Mich Talebzadeh
Hi, First thanks everyone for their suggestions. Much appreciated. This was the original queries written in SQL and run against Spark-shell val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) println ("nStarted at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/M

Re: Using functional programming rather than SQL

2016-02-22 Thread Michał Zieliński
Your SQL query will look something like that in DataFrames (but as Ted said, check the docs to see the signatures). smallsales .join(times,"time_id") .join(channels,"channel_id") .groupBy("calendar_month_desc","channel_desc") .agg(sum(col("amount_sold")).as("TotalSales"), "calendar_month_desc","ch

Re: Using functional programming rather than SQL

2016-02-22 Thread Ted Yu
Mich: Please refer to the following test suite for examples on various DataFrame operations: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala On Mon, Feb 22, 2016 at 4:39 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Thanks Dean. > > I gather if I

Re: Using functional programming rather than SQL

2016-02-22 Thread Mich Talebzadeh
Thanks Dean. I gather if I wanted to get the whole thing through FP with little or no use of SQL, then for the first instance as I get the data set from Hive (i.e, val rs = HiveContext.sql("""SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales FROM smallsales s, ti

Re: Using functional programming rather than SQL

2016-02-22 Thread Koert Kuipers
however to really enjoy functional programming i assume you also want to use lambda in your map and filter, which means you need to convert DataFrame to Dataset, using df.as[SomeCaseClass]. Just be aware that its somewhat early days for Dataset. On Mon, Feb 22, 2016 at 6:45 PM, Kevin Mellott wrot

Re: Using functional programming rather than SQL

2016-02-22 Thread Dean Wampler
Kevin gave you the answer you need, but I'd like to comment on your subject line. SQL is a limited form of FP. Sure, there are no anonymous functions and other limitations, but it's declarative, like good FP programs should be, and it offers an important subset of the operators ("combinators") you

Re: Using functional programming rather than SQL

2016-02-22 Thread Kevin Mellott
In your example, the *rs* instance should be a DataFrame object. In other words, the result of *HiveContext.sql* is a DataFrame that you can manipulate using *filter, map, *etc. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext On Mon, Feb 22, 2016 at

Using functional programming rather than SQL

2016-02-22 Thread Mich Talebzadeh
Hi, I have data stored in Hive tables that I want to do simple manipulation. Currently in Spark I perform the following with getting the result set using SQL from Hive tables, registering as a temporary table in Spark Now Ideally I can get the result set into a DF and work on DF to slice a