Both Sybase and hive run on the same host?? Maybe this is the reason.. I think with older versions of spark, such as the one you have predicate push down (for ORc and/or parquet) does not work. This is another huge performance penalty. Additionally many optimizations probably do not work. I think in most of the cases Hive on tez makes most sense - should be definitely anyway a choice besides spark as an execution engine.
That being said there are lot of other tuning possibilities that go probably too far for a mailing list exchange. For example, I had some performance improvements in certain scenarios when using JDK8 or many of the other tunig possibilities. You can think also about using ignite etc. It is a complex ecosystem where you can do a lot ... > On 30 Dec 2015, at 20:26, Marcin Tustin <mtus...@handybook.com> wrote: > > I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much > faster than MR in pretty much all cases. Also, with hive, you'll make sure > you've performed optimizations like aligning ORC stripe sizes with HDFS block > sizes, and concatenated your tables (not so much an optimization as a must > for avoiding the small files problem). > >> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote: >> Thanks again Jorn. >> >> >> >> >> >> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I >> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) >> indexes on the time_id column. For the dimension table (times) I have >> time_id defined as primary key. Also Sybase IQ creates FP (fast projection) >> indexes on every column by default. >> >> >> >> Anyway I am trying to download and build TEZ. Do we know which version of >> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha >> >> >> >> Thanks >> >> >> >> Mich Talebzadeh >> >> >> >> Sybase ASE 15 Gold Medal Award 2008 >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", >> ISBN 978-0-9563693-0-7. >> >> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4 >> >> Publications due shortly: >> >> Complex Event Processing in Heterogeneous Environments, ISBN: >> 978-0-9563693-3-8 >> >> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus free, >> therefore neither Peridale Ltd, its subsidiaries nor their employees accept >> any responsibility. >> >> >> >> From: Jörn Franke [mailto:jornfra...@gmail.com] >> Sent: 30 December 2015 16:29 >> >> >> To: user@hive.apache.org >> Subject: Re: Running the same query on 1 billion rows fact table in Hive on >> Spark compared to Sybase IQ columnar database >> >> >> >> Hmm i think the execution Engine TEZ has (currently) the most optimizations >> on Hive. What about your hardware - is it the same? Do you have also >> compression on Sybase? >> >> Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + >> llap). >> >> >> On 30 Dec 2015, at 13:47, Mich Talebzadeh <m...@peridale.co.uk> wrote: >> >> Hi Jorn, >> >> >> >> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not >> tried it on TEZ. I tried the query on MR engine and it did nor fair better. >> I also ran it without SDDDEV function and found out that the function did >> not slow it down. >> >> >> >> I tried a simple query as follows builr in sales FACT table 1e9 rows and >> dimension table times (1826 rows) >> >> >> >> -- >> >> -- Get the total amount sold for each calendar month >> >> -- >> >> SELECT t.calendar_month_desc, SUM(s.amount_sold) >> >> FROM sales s, times t WHERE s.time_id = t.time_id >> >> GROUP BY t.calendar_month_desc; >> >> >> >> Now Sybase IQ comes back in around 30 seconds. >> >> >> >> Started query at Dec 30 2015 08:14:33:399AM >> >> (48 rows affected) >> >> Finished query at Dec 30 2015 08:15:04:640AM >> >> >> >> Whereas Hive with the following setting and running the same query >> >> >> >> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; >> >> set hive.optimize.bucketmapjoin=true; >> >> set hive.optimize.bucketmapjoin.sortedmerge=true; >> >> >> >> Comes back in >> >> >> >> 48 rows selected (1514.687 seconds) >> >> >> >> I don’t know what else can be done. Obviously this is all schema on read so >> I am not sure I can change bucketing on FACT table based on one query alone! >> >> >> >> >> >> >> >> +--------------------------------------------------------------------+--+ >> >> | createtab_stmt | >> >> +--------------------------------------------------------------------+--+ >> >> | CREATE TABLE `times`( | >> >> | `time_id` timestamp, | >> >> | `day_name` varchar(9), | >> >> | `day_number_in_week` int, | >> >> | `day_number_in_month` int, | >> >> | `calendar_week_number` int, | >> >> | `fiscal_week_number` int, | >> >> | `week_ending_day` timestamp, | >> >> | `week_ending_day_id` bigint, | >> >> | `calendar_month_number` int, | >> >> | `fiscal_month_number` int, | >> >> | `calendar_month_desc` varchar(8), | >> >> ---------- >> >> | `days_in_fis_year` bigint, | >> >> | `end_of_cal_year` timestamp, | >> >> | `end_of_fis_year` timestamp) | >> >> | CLUSTERED BY ( | >> >> | time_id) | >> >> | INTO 256 BUCKETS | >> >> | ROW FORMAT SERDE | >> >> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | >> >> | STORED AS INPUTFORMAT | >> >> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | >> >> | OUTPUTFORMAT | >> >> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | >> >> | LOCATION | >> >> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' | >> >> | TBLPROPERTIES ( | >> >> | 'COLUMN_STATS_ACCURATE'='true', | >> >> | 'numFiles'='1', | >> >> | 'numRows'='1826', | >> >> | 'orc.bloom.filter.columns'='TIME_ID', | >> >> | 'orc.bloom.filter.fpp'='0.05', | >> >> | 'orc.compress'='SNAPPY', | >> >> | 'orc.create.index'='true', | >> >> | 'orc.row.index.stride'='10000', | >> >> | 'orc.stripe.size'='268435456', | >> >> | 'rawDataSize'='0', | >> >> | 'totalSize'='11155', | >> >> | 'transient_lastDdlTime'='1451429900') | >> >> >> >> ; >> >> >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus free, >> therefore neither Peridale Ltd, its subsidiaries nor their employees accept >> any responsibility. >> >> >> >> From: Jörn Franke [mailto:jornfra...@gmail.com] >> Sent: 30 December 2015 08:28 >> To: user@hive.apache.org >> Subject: Re: Running the same query on 1 billion rows fact table in Hive on >> Spark compared to Sybase IQ columnar database >> >> >> >> Have you tried it with Hive ob TEZ? It contains (currently) more >> optimizations than Hive on Spark. >> >> I assume you use the latest Hive version. >> >> Additionally you may want to think about calculating statistics (depending >> on your configuration you need to trigger it) - I am not sure if Spark can >> use them. >> >> I am not sure if bloom filters on the columns you mention make sense. You >> may also want to increase stride size (depending on your data). >> >> Currently you bucket by a lot of fields, which may not make sense. You also >> may want to sort the data by customer Id in the table. >> >> You also seem to have a lot of reducers, which you may want to decrease. >> >> >> >> Have you tried without "having stddev_samp" ? Is the query exactly the same >> as in Sybase? >> >> >> On 29 Dec 2015, at 11:53, Mich Talebzadeh <m...@peridale.co.uk> wrote: >> >> Hi, >> >> >> >> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion >> rows as follows: >> >> >> >> show create table sales; >> >> +-------------------------------------------------------------------------------+--+ >> >> | createtab_stmt >> | >> >> +-------------------------------------------------------------------------------+--+ >> >> | CREATE TABLE `sales`( >> | >> >> | `prod_id` bigint, >> | >> >> | `cust_id` bigint, >> | >> >> | `time_id` timestamp, >> | >> >> | `channel_id` bigint, >> | >> >> | `promo_id` bigint, >> | >> >> | `quantity_sold` decimal(10,0), >> | >> >> | `amount_sold` decimal(10,0)) >> | >> >> | CLUSTERED BY ( >> | >> >> | prod_id, >> | >> >> | cust_id, >> | >> >> | time_id, >> | >> >> | channel_id, >> | >> >> | promo_id) >> | >> >> | INTO 256 BUCKETS >> | >> >> | ROW FORMAT SERDE >> | >> >> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' >> | >> >> | STORED AS INPUTFORMAT >> | >> >> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' >> | >> >> | OUTPUTFORMAT >> | >> >> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' >> | >> >> | LOCATION >> | >> >> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' >> | >> >> | TBLPROPERTIES ( >> | >> >> | 'COLUMN_STATS_ACCURATE'='true', >> | >> >> | 'last_modified_by'='hduser', >> | >> >> | 'last_modified_time'='1451305626', >> | >> >> | 'numFiles'='11', >> | >> >> | 'numRows'='1000000000', >> | >> >> | >> 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', | >> >> | 'orc.bloom.filter.fpp'='0.05', >> | >> >> | 'orc.compress'='SNAPPY', >> | >> >> | 'orc.create.index'='true', >> | >> >> | 'orc.row.index.stride'='10000', >> | >> >> | 'orc.stripe.size'='268435456', >> | >> >> | 'rawDataSize'='296000000000', >> | >> >> | 'totalSize'='2678882153', >> | >> >> | 'transient_lastDdlTime'='1451305626') >> | >> >> +-------------------------------------------------------------------------------+--+ >> >> >> >> I use the following query to run against sales table only against Hive >> >> >> >> SELECT >> >> rs.Customer_ID >> >> , rs.Number_of_orders >> >> , rs.Total_customer_amount >> >> , rs.Average_order >> >> , rs.Standard_deviation >> >> FROM >> >> ( >> >> SELECT cust_id AS Customer_ID, >> >> COUNT(amount_sold) AS Number_of_orders, >> >> SUM(amount_sold) AS Total_customer_amount, >> >> AVG(amount_sold) AS Average_order, >> >> stddev_samp(amount_sold) AS Standard_deviation >> >> FROM sales >> >> GROUP BY cust_id >> >> HAVING SUM(amount_sold) > 94000 >> >> AND AVG(amount_sold) < stddev_samp(amount_sold) >> >> ) rs >> >> ORDER BY >> >> -- Total_customer_amount DESC >> >> 3 DESC >> >> >> >> Hive comes back in 17 minutes with 5,948 rows >> >> >> >> bl -f sales.hql > sales.log >> >> Connecting to jdbc:hive2://rhes564:10010/default >> >> Connected to: Apache Hive (version 1.2.1) >> >> Driver: Hive JDBC (version 1.2.1) >> >> Transaction isolation: TRANSACTION_REPEATABLE_READ >> >> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql >> >> No rows affected (0.097 seconds) >> >> No rows affected (0.001 seconds) >> >> No rows affected (0.001 seconds) >> >> No rows affected (0.038 seconds) >> >> INFO : Warning: Using constant number 3 in order by. If you try to use >> position alias when hive.groupby.orderby.position.alias is false, the >> position alias will be ignored. >> >> INFO : >> >> Query Hive on Spark job[0] stages: >> >> INFO : 0 >> >> INFO : 1 >> >> INFO : 2 >> >> INFO : >> >> Status: Running (Hive on Spark job[0]) >> >> INFO : Job Progress Format >> >> CurrentTime StageId_StageAttemptId: >> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >> [StageCost] >> >> INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> >> >> INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 >> Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: >> 0(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: >> 6(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: >> 30(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: >> 59(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: >> 90(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: >> 124(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: >> 160(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: >> 193(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: >> 228(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: >> 262(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: >> 298(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: >> 338(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: >> 376(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: >> 417(+3)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: >> 460(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: >> 502(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: >> 542(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: >> 584(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: >> 624(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: >> 667(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: >> 709(+3)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: >> 754(+3)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: >> 797(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: >> 844(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: >> 888(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: >> 934(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: >> 981(+2)/1009 Stage-2_0: 0/1 >> >> INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: >> 1009/1009 Finished Stage-2_0: 0(+1)/1 >> >> INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: >> 1009/1009 Finished Stage-2_0: 1/1 Finished >> >> INFO : Status: Finished successfully in 1036.00 seconds >> >> 5,948 rows selected (1074.817 seconds) >> >> >> >> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in >> 23 seconds >> >> >> >> Sybase IQ is a columnar database so each column is created as a fast >> projection index by default. In addition I have created LF (bitmap) indexes >> on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now >> the query only touches CUST_ID. >> >> >> >> My suspicion is that it is the Standard Deviation function stddev_samp() >> that could be the bottleneck? >> >> >> >> Thanks >> >> >> >> Mich Talebzadeh >> >> >> >> Sybase ASE 15 Gold Medal Award 2008 >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", >> ISBN 978-0-9563693-0-7. >> >> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4 >> >> Publications due shortly: >> >> Complex Event Processing in Heterogeneous Environments, ISBN: >> 978-0-9563693-3-8 >> >> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus free, >> therefore neither Peridale Ltd, its subsidiaries nor their employees accept >> any responsibility. >> > > > Want to work at Handy? Check out our culture deck and open roles > Latest news at Handy > Handy just raised $50m led by Fidelity >