I agree but Spark 1.3.1 on Hive is the only one I have managed to make it work. Still it is twice as fast as Hive on MapReduce.
Just to clarify my understanding is that the optimiser is provided by Hive and is the same for both executions engines. Is there anything specific that Spark 1.3.1 lacks compared to Spark 1.5.1 when executing the query? Thanks Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: 31 December 2015 18:44 To: user@hive.apache.org Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database You are using an old version of Spark and it cannot leverage all optimizations of Hive, so I think that your conclusion cannot be as easy as you might think. On 31 Dec 2015, at 19:34, Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk> > wrote: Ok guys. I have not succeeded in installing TEZ. Yet so I can try the query on TEZ as well. Just to remind that the query is used is pretty common. Get the total amount sold for each calendar month from sales (I billion rows) and times SELECT t.calendar_month_desc, SUM(s.amount_sold) FROM sales s, times t WHERE s.time_id = t.time_id GROUP BY t.calendar_month_desc; In total 48 rows are returned back Now having thought about It, granted TEZ is going to be faster than MR as it is basically MR with DAG thrown at it. On the other Spark will have both DAG and in-memory calculation. The results are as follow: Optimiser Engine Timing Compression Total Table size Hive MapReduce 4673.035 seconds Snappy totalSize=2678882153 = 2.5GB Hive Spark 1.3.1 1578.817 seconds Snappy Columnar Sybase IQ 30.000 seconds Native 5GB It is pretty obvious that Spark outperforms MapReduce more than twice even taking into account the number of rows on the FACT table and frankly I would not have thought that TEZ is going to beat Spark (to be seen). Having said that Hive storage is twice more efficient but I am not sure what one can do to improve the performance. Table in Hive is stored as ORC table and it has crossed my mind that maybe we should think about storing every column of an ORC table as an index. That may improve the performance further. HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Marcin Tustin [mailto:mtus...@handybook.com] Sent: 30 December 2015 19:27 To: user@hive.apache.org <mailto:user@hive.apache.org> Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem). On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk> > wrote: Thanks again Jorn. Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default. Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha Thanks Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.