Hi Unit, *For new tables*
Disable stats autogathering in Hive when creating a new table and populating it SET hive.stats.autogather=false; *Already existing tables* As a work-around you can try this on the already existing tables by manually alter the numRows to -1 ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES ('numRows'='-1'); Example 0: jdbc:hive2://rhes564:10010/default> create table testme as select * from sales_staging limit 1000; 0: jdbc:hive2://rhes564:10010/default> desc formatted testme; +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ | col_name | data_type | comment | +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ | # col_name | data_type | comment | | | NULL | NULL | | prod_id | double | | | cust_id | double | | | time_id | string | | | channel_id | double | | | promo_id | double | | | quantity_sold | double | | | amount_sold | double | | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | oraclehadoop | NULL | | Owner: | hduser | NULL | | CreateTime: | Sat Apr 30 00:31:17 BST 2016 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Location: | hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} | | | numFiles | 1 | | | numRows | 1000 | | | rawDataSize | 54853 | | | totalSize | 55853 | | | transient_lastDdlTime | 1461972677 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | | InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | Storage Desc Params: | NULL | NULL | | | serialization.format | 1 | +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ 0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme SET TBLPROPERTIES ('numRows'='-1');* 0: jdbc:hive2://rhes564:10010/default> desc formatted testme; +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ | col_name | data_type | comment | +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ | # col_name | data_type | comment | | | NULL | NULL | | prod_id | double | | | cust_id | double | | | time_id | string | | | channel_id | double | | | promo_id | double | | | quantity_sold | double | | | amount_sold | double | | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | oraclehadoop | NULL | | Owner: | hduser | NULL | | CreateTime: | Sat Apr 30 00:31:17 BST 2016 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Location: | hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | last_modified_by | hduser | | | last_modified_time | 1461973002 | | | numFiles | 1 | | | numRows | -1 | | | rawDataSize | 54853 | | | totalSize | 55853 | | | transient_lastDdlTime | 1461973002 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | | InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | Storage Desc Params: | NULL | NULL | | | serialization.format | 1 | +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ Hopefully that will turn off the autogather feature for existing tables. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 29 April 2016 at 23:32, Udit Mehta <ume...@groupon.com> wrote: > Hi, > > Thanks for the replies. > We have a scenario where we have an ETL job inserting into a table with > thousands of partitions using dynamic partitioning. We have certain SLA's > within which we would like the job to finish and sometimes there are > scenarios where they are missed (extra data or a busy cluster). I > understand that stats are essential for Hive CBO but we are trying to > explore how much overhead do these stats collection add to the job runtime. > A lot of these tables are intermediary tables so having stats for them > might not be entirely necessary. > > I just wanted to figure if there was a easy way to disable the stats and > then compare the performance. > > Mich, can you give more information on how to disable it in the table > struct as I cant find any documentation on it. > > Thanks again. > Udit > > On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <pxi...@apache.org> > wrote: > >> Hi Udit, >> >> Could u be more specific about your problem? Like, what settings you >> have, what query you run and what is the result and what result do you >> expect? >> >> From what you said, my understanding is that, you want to wipe out >> the basic stats for existing tables? And, could u also let us know why you >> would like to get rid of the stats? Stats is crucial for Hive CBO to work >> and we are moving towards the direction to make table/column stats >> collection automatically. It seems that you prefer an opposite direction. >> There is nothing wrong here and we would like to listen to your idea and >> motivation so that we can better design Hive stats collection. Thanks! >> >> Best >> Pengcheng >> >> >> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <ume...@groupon.com> wrote: >> >>> Any insights on this? >>> >>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <ume...@groupon.com> wrote: >>> >>>> Update: Realized this works if we create a fresh table with this config >>>> already disabled but does not work if there is already a table created when >>>> this config was enabled. We now need to figure out how to disable this >>>> config for a table created when this config was true. >>>> >>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <ume...@groupon.com> wrote: >>>> >>>>> Hive version we are using is 1.2.1. >>>>> >>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <ume...@groupon.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> We need to disable the Hive autogather stats optimization by >>>>>> disabling "*hive.stats.autogather*" but for some reason, the config >>>>>> change doesnt seem to go through. We modified this config in the >>>>>> hive-site.xml and restarted the Hive metastore. We also made this change >>>>>> explicitly in the job but it doesnt seem to help. >>>>>> >>>>>> >>>>>> >>>>>> *set hive.stats.autogather=false;* >>>>>> Does anyone know the right way to disable this config since we dont >>>>>> want to compute stats in out jobs. >>>>>> >>>>>> Thanks, >>>>>> Udit >>>>>> >>>>> >>>>> >>>> >>> >> >