Re: Hive on Spark performance

Mich Talebzadeh Mon, 14 Mar 2016 08:38:09 -0700

Hi Wlodeck,

Let us look at this.


In Oracle I have two tables channels and sales. This code works in Oracle

  1  select c.channel_id, sum(c.channel_id * (select count(1) from sales s
WHERE c.channel_id = s.channel_id)) As R
  2  from channels c
  3* group by c.channel_id
s...@mydb.mich.LOCAL> /
CHANNEL_ID          R
---------- ----------
         2     516050
         3    1620984
         4     473664
         5          0
         9      18666

I have the same tables In Hive but the same query crashes!

hive> select c.channel_id, sum(c.channel_id * (select count(1) from sales s
WHERE c.channel_id = s.channel_id)) As R
    > from channels c
    > group by c.channel_id
    > ;
NoViableAltException(232@[435:1: precedenceEqualExpression : ( ( LPAREN
precedenceBitwiseOrExpression COMMA )=> precedenceEqualExpressionMutiple |
precedenceEqualExpressionSingle );])

The solution is to use a temporary table to keep the sum/group by from
sales table as an intermediate stage  (temporary tables are session
specific and they are created and dropped after you finish the session)

hive> create temporary table tmp as select channel_id, count(channel_id) as
total from sales group by channel_id;


Ok the rest is pretty easy

hive> select c.channel_id, c.channel_id * t.total as results
    > from channels c, tmp t
    > where c.channel_id = t.channel_id;

2.0     2800432.0
3.0     8802300.0
4.0     2583552.0
9.0     104013.0

HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 March 2016 at 14:22, ws <wlodek_sie...@yahoo.com> wrote:

> Hive 1.2.1.2.3.4.0-3485
> Spark 1.5.2
> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit
> Production
>
> ###
> SELECT
> f.description,
> f.item_number,
> sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id))
> as df_a
> FROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates r
> where f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND
> f.item_number NOT LIKE 'DEV%'
> group by
> f.description,
> f.item_number
> ###
>
> This query works fine in oracle but not Hive or Spark.
> So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where
> hb_h_name = r.h_id)) as df_a" field.
>
>
> Thanks,
> Wlodek
> --
>
>
> On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Depending on the version of Hive on Spark engine.
>
> As far as I am aware the latest version of Hive that I am using (Hive 2)
> has improvements compared to the previous versions of Hive (0.14,1.2.1) on
> Spark engine.
>
> As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it
> is not the latest Spark but it is pretty good.
>
> What specific concerns do you have in mind?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 13 March 2016 at 23:27, sjayatheertha <sjayatheer...@gmail.com> wrote:
>
> Just curious if you could share your experience on the performance of
> spark in your company? How much data do you process? And what's the latency
> you are getting with spark engine?
>
> Vidya
>
>
>
>
>

Re: Hive on Spark performance

Reply via email to