Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Marcin Tustin Wed, 30 Dec 2015 12:43:31 -0800

I'm afraid I use the HDP distribution so I haven't yet had to compile
anything. (Incidentally, this isn't a recommendation of HDP over anything
else).


On Wed, Dec 30, 2015 at 3:33 PM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Thanks Marcin
>
>
>
> Trying to build TEZ 0.7 in
>
>
>
> /usr/lib/apache-tez-0.7.0-src
>
>
>
> using
>
>
>
> mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
>
>
>
> with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it
> OK with 3.2.5 following the same error ass below
>
>
>
> mvn --version
>
> Apache Maven *3.2.5* (12a6b3acb947671f09b81f49094c53f426d8cea1;
> 2014-12-14T17:29:23+00:00)
>
> Maven home: /usr/local/apache-maven/apache-maven-3.2.5
>
> Java version: 1.7.0_25, vendor: Oracle Corporation
>
> Java home: /usr/java/jdk1.7.0_25/jre
>
>
>
> *I get this error*
>
>
>
> [INFO] tez-ui ............................................. FAILURE [
> 0.411 s]
>
> [
>
>
>
> DEBUG] -- end configuration --
>
> [INFO] Running 'npm install --color=false' in
> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
>
> [INFO]
> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh:
> line 3: 23781 Aborted                 "$@"
>
>
>
>
>
> [ERROR] Failed to execute goal
> com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
> project tez-ui: Failed to run task: 'npm install --color=false' failed.
> (error code 134) -> [Help 1]
>
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
> project tez-ui: Failed to run task
>
>
>
>
>
> any ideas as there is little info available in net.
>
>
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
> *Sent:* 30 December 2015 19:27
>
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is
> much faster than MR in pretty much all cases. Also, with hive, you'll make
> sure you've performed optimizations like aligning ORC stripe sizes with
> HDFS block sizes, and concatenated your tables (not so much an optimization
> as a must for avoiding the small files problem).
>
>
>
> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Thanks again Jorn.
>
>
>
>
>
> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I
> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap)
> indexes on the time_id column. For the dimension table (times) I have
> time_id defined as primary key. Also Sybase IQ creates FP (fast projection)
> indexes on every column by default.
>
>
>
> Anyway I am trying to download and build TEZ. Do we know which version of
> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Jörn Franke [mailto:jornfra...@gmail.com]
> *Sent:* 30 December 2015 16:29
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
>
> Hmm i think the execution Engine TEZ has (currently) the most
> optimizations on Hive. What about your hardware - is it the same? Do you
> have also compression on Sybase?
>
> Alternatively you need to wait for Hive for interactive analytics (tez 0.8
> + llap).
>
>
> On 30 Dec 2015, at 13:47, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>
> Hi Jorn,
>
>
>
> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not
> tried it on TEZ. I tried the query on MR engine and it did nor fair better.
> I also ran it without SDDDEV function and found out that the function did
> not slow it down.
>
>
>
> I tried a simple query as follows builr in sales FACT table 1e9 rows and
> dimension table times (1826 rows)
>
>
>
> --
>
> -- Get the total amount sold for each calendar month
>
> --
>
> *SELECT t.calendar_month_desc, SUM(s.amount_sold)*
>
> *FROM sales s, times t WHERE s.time_id = t.time_id*
>
> *GROUP BY t.calendar_month_desc;*
>
>
>
> Now Sybase IQ comes back in around 30 seconds.
>
>
>
> Started query at Dec 30 2015 08:14:33:399AM
>
> (48 rows affected)
>
> Finished query at Dec 30 2015 08:15:04:640AM
>
>
>
> Whereas Hive with the following setting and running the same query
>
>
>
> set
> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>
> set hive.optimize.bucketmapjoin=true;
>
> set hive.optimize.bucketmapjoin.sortedmerge=true;
>
>
>
> Comes back in
>
>
>
> 48 rows selected (1514.687 seconds)
>
>
>
> I don’t know what else can be done. Obviously this is all schema on read
> so I am not sure I can change bucketing on FACT table based on one query
> alone!
>
>
>
>
>
>
>
> +--------------------------------------------------------------------+--+
>
> |                           createtab_stmt                           |
>
> +--------------------------------------------------------------------+--+
>
> | CREATE TABLE `times`(                                              |
>
> |   `time_id` timestamp,                                             |
>
> |   `day_name` varchar(9),                                           |
>
> |   `day_number_in_week` int,                                        |
>
> |   `day_number_in_month` int,                                       |
>
> |   `calendar_week_number` int,                                      |
>
> |   `fiscal_week_number` int,                                        |
>
> |   `week_ending_day` timestamp,                                     |
>
> |   `week_ending_day_id` bigint,                                     |
>
> |   `calendar_month_number` int,                                     |
>
> |   `fiscal_month_number` int,                                       |
>
> |   `calendar_month_desc` varchar(8),                                |
>
> ----------
>
> |   `days_in_fis_year` bigint,                                       |
>
> |   `end_of_cal_year` timestamp,                                     |
>
> |   `end_of_fis_year` timestamp)                                     |
>
> | CLUSTERED BY (                                                     |
>
> |   time_id)                                                         |
>
> | INTO 256 BUCKETS                                                   |
>
> | ROW FORMAT SERDE                                                   |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                      |
>
> | STORED AS INPUTFORMAT                                              |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                |
>
> | OUTPUTFORMAT                                                       |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'               |
>
> | LOCATION                                                           |
>
> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times'  |
>
> | TBLPROPERTIES (                                                    |
>
> |   'COLUMN_STATS_ACCURATE'='true',                                  |
>
> |   'numFiles'='1',                                                  |
>
> |   'numRows'='1826',                                                |
>
> |   'orc.bloom.filter.columns'='TIME_ID',                            |
>
> |   'orc.bloom.filter.fpp'='0.05',                                   |
>
> |   'orc.compress'='SNAPPY',                                         |
>
> |   'orc.create.index'='true',                                       |
>
> |   'orc.row.index.stride'='10000',                                  |
>
> |   'orc.stripe.size'='268435456',                                   |
>
> |   'rawDataSize'='0',                                               |
>
> |   'totalSize'='11155',                                             |
>
> |   'transient_lastDdlTime'='1451429900')                            |
>
>
>
> ;
>
>
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Jörn Franke [mailto:jornfra...@gmail.com <jornfra...@gmail.com>]
> *Sent:* 30 December 2015 08:28
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
> Have you tried it with Hive ob TEZ? It contains (currently) more
> optimizations than Hive on Spark.
>
> I assume you use the latest Hive version.
>
> Additionally you may want to think about calculating statistics (depending
> on your configuration you need to trigger it) - I am not sure if Spark can
> use them.
>
> I am not sure if bloom filters on the columns you mention make sense. You
> may also want to increase stride size (depending on your data).
>
> Currently you bucket by a lot of fields, which may not make sense. You
> also may want to sort the data by customer Id in the table.
>
> You also seem to have a lot of reducers, which you may want to decrease.
>
>
>
> Have you tried without "having stddev_samp" ? Is the query exactly the
> same as in Sybase?
>
>
> On 29 Dec 2015, at 11:53, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>
> Hi,
>
>
>
> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1
> billion rows as follows:
>
>
>
> show create table sales;
>
>
> +-------------------------------------------------------------------------------+--+
>
> |
> createtab_stmt                                 |
>
>
> +-------------------------------------------------------------------------------+--+
>
> | CREATE TABLE
> `sales`(                                                         |
>
> |   `prod_id`
> bigint,                                                           |
>
> |   `cust_id`
> bigint,                                                           |
>
> |   `time_id`
> timestamp,                                                        |
>
> |   `channel_id`
> bigint,                                                        |
>
> |   `promo_id`
> bigint,                                                          |
>
> |   `quantity_sold`
> decimal(10,0),                                              |
>
> |   `amount_sold`
> decimal(10,0))                                                |
>
> | CLUSTERED BY (
>                                                      |
>
> |
> prod_id,
> |
>
> |
> cust_id,
> |
>
> |   time_id,
>                                            |
>
> |
> channel_id,
> |
>
> |
> promo_id)
> |
>
> | INTO 256 BUCKETS
>                                  |
>
> | ROW FORMAT
> SERDE                                                              |
>
> |
> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> |
>
> | STORED AS INPUTFORMAT
>                        |
>
> |
> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> |
>
> |
> OUTPUTFORMAT
> |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>              |
>
> |
> LOCATION
> |
>
> |
> 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'
> |
>
> | TBLPROPERTIES
> (                                                               |
>
> |
> 'COLUMN_STATS_ACCURATE'='true',
> |
>
> |
> 'last_modified_by'='hduser',
> |
>
> |
> 'last_modified_time'='1451305626',
> |
>
> |
> 'numFiles'='11',
> |
>
> |
> 'numRows'='1000000000',
> |
>
> |
> 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',
> |
>
> |
> 'orc.bloom.filter.fpp'='0.05',
> |
>
> |
> 'orc.compress'='SNAPPY',
> |
>
> |
> 'orc.create.index'='true',
> |
>
> |
> 'orc.row.index.stride'='10000',
> |
>
> |
> 'orc.stripe.size'='268435456',
> |
>
> |
> 'rawDataSize'='296000000000',
> |
>
> |   'totalSize'='2678882153',
>                                                   |
>
> |
> 'transient_lastDdlTime'='1451305626')
> |
>
>
> +-------------------------------------------------------------------------------+--+
>
>
>
> I use the following query to run against sales table only against Hive
>
>
>
> SELECT
>
>           rs.Customer_ID
>
>         , rs.Number_of_orders
>
>         , rs.Total_customer_amount
>
>         , rs.Average_order
>
>         , rs.Standard_deviation
>
> FROM
>
> (
>
>         SELECT cust_id AS Customer_ID,
>
>         COUNT(amount_sold) AS Number_of_orders,
>
>         SUM(amount_sold) AS Total_customer_amount,
>
>         AVG(amount_sold) AS Average_order,
>
>         stddev_samp(amount_sold) AS Standard_deviation
>
>         FROM sales
>
>         GROUP BY cust_id
>
>         HAVING SUM(amount_sold) > 94000
>
>         AND AVG(amount_sold) < stddev_samp(amount_sold)
>
> ) rs
>
> ORDER BY
>
>           -- Total_customer_amount DESC
>
>           3 DESC
>
>
>
> Hive comes back in 17 minutes with 5,948 rows
>
>
>
> bl -f sales.hql > sales.log
>
> Connecting to jdbc:hive2://rhes564:10010/default
>
> Connected to: Apache Hive (version 1.2.1)
>
> Driver: Hive JDBC (version 1.2.1)
>
> Transaction isolation: TRANSACTION_REPEATABLE_READ
>
> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
>
> No rows affected (0.097 seconds)
>
> No rows affected (0.001 seconds)
>
> No rows affected (0.001 seconds)
>
> No rows affected (0.038 seconds)
>
> INFO  : Warning: Using constant number 3 in order by. If you try to use
> position alias when hive.groupby.orderby.position.alias is false, the
> position alias will be ignored.
>
> INFO  :
>
> Query Hive on Spark job[0] stages:
>
> INFO  : 0
>
> INFO  : 1
>
> INFO  : 2
>
> INFO  :
>
> Status: Running (Hive on Spark job[0])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO  : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009
> Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009
> Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11     Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
>
>
> INFO  : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11    Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11    Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11    Stage-1_0:
> 0/1009       Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished       Stage-1_0:
> 0(+2)/1009   Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished       Stage-1_0:
> 6(+2)/1009   Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished       Stage-1_0:
> 30(+2)/1009  Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished       Stage-1_0:
> 59(+2)/1009  Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished       Stage-1_0:
> 90(+2)/1009  Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished       Stage-1_0:
> 124(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished       Stage-1_0:
> 160(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished       Stage-1_0:
> 193(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished       Stage-1_0:
> 228(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished       Stage-1_0:
> 262(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished       Stage-1_0:
> 298(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished       Stage-1_0:
> 338(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished       Stage-1_0:
> 376(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished       Stage-1_0:
> 417(+3)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished       Stage-1_0:
> 460(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished       Stage-1_0:
> 502(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished       Stage-1_0:
> 542(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished       Stage-1_0:
> 584(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished       Stage-1_0:
> 624(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished       Stage-1_0:
> 667(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished       Stage-1_0:
> 709(+3)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished       Stage-1_0:
> 754(+3)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished       Stage-1_0:
> 797(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished       Stage-1_0:
> 844(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished       Stage-1_0:
> 888(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished       Stage-1_0:
> 934(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished       Stage-1_0:
> 981(+2)/1009 Stage-2_0: 0/1
>
> INFO  : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished       Stage-1_0:
> 1009/1009 Finished   Stage-2_0: 0(+1)/1
>
> INFO  : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished       Stage-1_0:
> 1009/1009 Finished   Stage-2_0: 1/1 Finished
>
> INFO  : Status: Finished successfully in 1036.00 seconds
>
> *5,948 rows selected (1074.817 seconds)*
>
>
>
> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in
> 23 seconds
>
>
>
> Sybase IQ is a columnar database so each column is created as a fast
> projection index by default. In addition I have created LF (bitmap) indexes
> on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now
> the query only touches CUST_ID.
>
>
>
> My suspicion is that it is the Standard Deviation function stddev_samp() that
> could be the bottleneck?
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
>
> Latest news <http://www.handy.com/press> at Handy
>
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>
>
> [image: Image removed by sender.]
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Reply via email to