Re: Hive 2 performance

Alan Gates Thu, 25 Feb 2016 15:45:07 -0800

HPLSQL is part of Hive, but it is not fully integrated into Hive itself yet.  
It is still an external module that handles the control flow while passing Hive 
SQL into Hive via JDBC.  We’d like to integrate it fully with Hive’s parser but 
we’re not there yet.


Alan.

> On Feb 25, 2016, at 14:26, Mich Talebzadeh 
> <mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> 
> Hi Gopal,
> 
>  
> Is HPLSQL is integrated into Hive 2 as part of its SQL? 
> 
> Thanks,
> 
>  
> Mich
> 
> On 25/02/2016 10:38, Mich Talebzadeh wrote:
> 
>> Apologies the job on Spark using  Functional programming was run on a bigger 
>> table.
>> 
>> The correct timing is 42 seconds for Spark
>> 
>>  
>> On 25/02/2016 10:15, Mich Talebzadeh wrote:
>> 
>> hanks Gopal I made the following observation so far:
>> 
>> Using the old MR you get this message now which is fine
>> 
>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
>> versions. Consider using a different execution engine (i.e. tez, spark) or 
>> using Hive 1.X releases.
>> 
>> use oraclehadoop;
>> --set hive.execution.engine=spark;
>> set hive.execution.engine=mr;
>> --
>> -- Get the total amount sold for each calendar month
>> --
>> 
>> select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS 
>> StartTime;
>> 
>> CREATE TEMPORARY TABLE tmp AS
>> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS 
>> TotalSales
>> --FROM smallsales s, times t, channels c
>> FROM smallsales s, times t, channels c
>> WHERE s.time_id = t.time_id
>> AND   s.channel_id = c.channel_id
>> GROUP BY t.calendar_month_desc, c.channel_desc
>> ;
>> 
>> select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS 
>> FirstQuery;
>> SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
>> from tmp
>> ORDER BY MONTH, CHANNEL LIMIT 5
>> ;
>> select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS 
>> SecondQuery;
>> SELECT channel_desc AS CHANNEL, MAX(TotalSales)  AS SALES
>> FROM tmp
>> GROUP BY channel_desc
>> order by SALES DESC LIMIT 5
>> ;
>> select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS EndTime;
>> 
>> This batch returns results on MR in 2 min, 3 seconds
>> 
>> If I change my engine to Hive 2 on Spark 1.3.1. I get it back in 1 min, 9 sec
>> 
>>  
>> If I run that job on Spark 1.5.2 shell  against the same tables using 
>> Functional programming and Hive Context for tables
>> 
>> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> println ("\nStarted at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> HiveContext.sql("use oraclehadoop")
>> var s = 
>> HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
>> val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
>> val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
>> println ("\ncreating data set at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> val rs = 
>> s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))
>> println ("\nfirst query at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> val rs1 = 
>> rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println)
>> println ("\nsecond query at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> val rs2 
>> =rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println)
>> println ("\nFinished at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> 
>> I get the job done in under 8  min. Ok this is not a benchmark for Spark but 
>> shows that Hive 2 has improved significantly IMO. I also had Hive on Spark 
>> 1.3.1 crashing on certain large tables(had to revert to MR) but no issues 
>> now.
>> 
>> HTH
>> 
>> On 25/02/2016 09:13, Gopal Vijayaraghavan wrote:
>> 
>> Correct hence the question as I have done some preliminary tests on Hive 2. 
>> I want to share insights with other people who have performed the same
>> If you have feedback on Hive-2.0, I'm all ears.
>> 
>> I'm building up 2.1 features & fixes, so now would be a good time to bring
>> stuff up.
>> 
>> Speed mostly depends on whether you're using Hive-2.0 with LLAP or not -
>> if you're using the old engines, the plans still get much better (even for
>> MR).
>> 
>> Tez does get some stuff out of it, like the new shuffle join vertex
>> manager (hive.optimize.dynamic.partition.hashjoin).
>> 
>> LLAP will still win that out for <10s queries, because it takes approx ~10
>> mins for all the auto-generated vectorized classes to get JIT'd into tight
>> SIMD loops.
>> 
>> For something like TPC-H Q1, you can slowly see it turning all the null
>> checks into UncommonTrapBlob as the JIT slowly learns about the data &
>> finds .noNulls is always true.
>> 
>> Cheers,
>> Gopal
>> 
>> 
>> 
>>  
>>  
>> -- 
>> Dr Mich Talebzadeh
>> 
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Cloud Technology 
>> Partners Ltd, its subsidiaries or their employees, unless expressly so 
>> stated. It is the responsibility of the recipient to ensure that this email 
>> is virus free, therefore neither Cloud Technology partners Ltd, its 
>> subsidiaries nor their employees accept any responsibility.
>> 
>> 
>>  
>>  
>> -- 
>> Dr Mich Talebzadeh
>> 
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Cloud Technology 
>> Partners Ltd, its subsidiaries or their employees, unless expressly so 
>> stated. It is the responsibility of the recipient to ensure that this email 
>> is virus free, therefore neither Cloud Technology partners Ltd, its 
>> subsidiaries nor their employees accept any responsibility.
>> 
>> 
>  
>  
> -- 
> Dr Mich Talebzadeh
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
> 
>

Re: Hive 2 performance

Reply via email to