This is a point that I like to clarify please.
These are my assumptions:.
* Data resides in Hive tables in a Hive database
* Data has to be extracted from these tables. Tables are ORC so they
have ORC optimizations (Storage indexes, file, stride (64MB chunks of
data) , rowset
Hi Koert,
My bad. I used a smaller size "sales" table in SQL plan. Kindly see my
new figures.
On 24/02/2016 20:05, Koert Kuipers wrote:
> my assumption, which is apparently incorrect, was that the SQL gets
> translated into a catalyst plan that is executed in spark. the dataframe
> opera
my assumption, which is apparently incorrect, was that the SQL gets
translated into a catalyst plan that is executed in spark. the dataframe
operations (referred to by Mich as the FP results) also get translated into
a catalyst plan that is executed on the exact same spark platform. so
unless the S
i am still missing something. if it is executed in the source database,
which is hive in this case, then it does need hive, no? how can you execute
in hive without needing hive?
On Wed, Feb 24, 2016 at 1:25 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> I never said it needs
My apologies I definitely misunderstood. You are 100% correct.
On Feb 24, 2016 19:25, "Sabarish Sasidharan" <
sabarish.sasidha...@manthan.com> wrote:
> I never said it needs one. All I said is that when calling context.sql()
> the sql is executed in the source database (assuming datasource is Hive
I never said it needs one. All I said is that when calling context.sql()
the sql is executed in the source database (assuming datasource is Hive or
some RDBMS)
Regards
Sab
Regards
Sab
On 24-Feb-2016 11:49 pm, "Mohannad Ali" wrote:
> That is incorrect HiveContext does not need a hive instance to
That is incorrect HiveContext does not need a hive instance to run.
On Feb 24, 2016 19:15, "Sabarish Sasidharan" <
sabarish.sasidha...@manthan.com> wrote:
> Yes
>
> Regards
> Sab
> On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote:
>
>> are you saying that HiveContext.sql(...) runs on hive, and not o
Yes
Regards
Sab
On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote:
> are you saying that HiveContext.sql(...) runs on hive, and not on spark
> sql?
>
> On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> When using SQL your full query, including the
are you saying that HiveContext.sql(...) runs on hive, and not on spark sql?
On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> When using SQL your full query, including the joins, were executed in
> Hive(or RDBMS) and only the results were brought in
When using SQL your full query, including the joins, were executed in
Hive(or RDBMS) and only the results were brought into the Spark cluster. In
the FP case, the data for the 3 tables is first pulled into the Spark
cluster and then the join is executed.
Thus the time difference.
It's not immedia
instead of:
var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM
sales")
you should be able to do something like:
val s = HiveContext.table("sales").select("AMOUNT_SOLD", "TIME_ID",
"CHANNEL_ID")
its not obvious to me why the dataframe (aka FP) version would be
significantly slow
Hi,
First thanks everyone for their suggestions. Much appreciated.
This was the original queries written in SQL and run against Spark-shell
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
println ("nStarted at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/M
Your SQL query will look something like that in DataFrames (but as Ted
said, check the docs to see the signatures).
smallsales
.join(times,"time_id")
.join(channels,"channel_id")
.groupBy("calendar_month_desc","channel_desc")
.agg(sum(col("amount_sold")).as("TotalSales"),
"calendar_month_desc","ch
Mich:
Please refer to the following test suite for examples on various DataFrame
operations:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
On Mon, Feb 22, 2016 at 4:39 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> Thanks Dean.
>
> I gather if I
Thanks Dean.
I gather if I wanted to get the whole thing through FP with little or no
use of SQL, then for the first instance as I get the data set from Hive
(i.e,
val rs = HiveContext.sql("""SELECT t.calendar_month_desc,
c.channel_desc, SUM(s.amount_sold) AS TotalSales
FROM smallsales s, ti
however to really enjoy functional programming i assume you also want to
use lambda in your map and filter, which means you need to convert
DataFrame to Dataset, using df.as[SomeCaseClass]. Just be aware that its
somewhat early days for Dataset.
On Mon, Feb 22, 2016 at 6:45 PM, Kevin Mellott
wrot
Kevin gave you the answer you need, but I'd like to comment on your subject
line. SQL is a limited form of FP. Sure, there are no anonymous functions
and other limitations, but it's declarative, like good FP programs should
be, and it offers an important subset of the operators ("combinators") you
In your example, the *rs* instance should be a DataFrame object. In other
words, the result of *HiveContext.sql* is a DataFrame that you can
manipulate using *filter, map, *etc.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
On Mon, Feb 22, 2016 at
Hi,
I have data stored in Hive tables that I want to do simple manipulation.
Currently in Spark I perform the following with getting the result set
using SQL from Hive tables, registering as a temporary table in Spark
Now Ideally I can get the result set into a DF and work on DF to slice
a
19 matches
Mail list logo