thanks Mich. I will test this out and get back to you! On Fri, Apr 29, 2016 at 4:42 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> apologies should read "Udit" > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 30 April 2016 at 00:35, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Hi Unit, >> >> *For new tables* >> >> Disable stats autogathering in Hive when creating a new table and >> populating it >> >> SET hive.stats.autogather=false; >> >> *Already existing tables* >> >> As a work-around you can try this on the already existing tables by manually >> alter the numRows to -1 >> >> ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES >> ('numRows'='-1'); >> >> Example >> >> 0: jdbc:hive2://rhes564:10010/default> create table testme as select * >> from sales_staging limit 1000; >> >> 0: jdbc:hive2://rhes564:10010/default> desc formatted testme; >> >> >> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ >> | col_name | >> data_type | comment | >> >> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ >> | # col_name | >> data_type | >> comment | >> | | >> NULL | >> NULL | >> | prod_id | >> double >> | | >> | cust_id | >> double >> | | >> | time_id | >> string >> | | >> | channel_id | >> double >> | | >> | promo_id | >> double >> | | >> | quantity_sold | >> double >> | | >> | amount_sold | >> double >> | | >> | | >> NULL | >> NULL | >> | # Detailed Table Information | >> NULL | >> NULL | >> | Database: | >> oraclehadoop | >> NULL | >> | Owner: | >> hduser | >> NULL | >> | CreateTime: | Sat Apr 30 00:31:17 BST >> 2016 | NULL | >> | LastAccessTime: | >> UNKNOWN | >> NULL | >> | Retention: | >> 0 | >> NULL | >> | Location: | >> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme | >> NULL | >> | Table Type: | >> MANAGED_TABLE | >> NULL | >> | Table Parameters: | >> NULL | >> NULL | >> | | >> COLUMN_STATS_ACCURATE | >> {\"BASIC_STATS\":\"true\"} | >> | | >> numFiles | >> 1 | >> | | >> numRows | >> 1000 | >> | | >> rawDataSize | >> 54853 | >> | | >> totalSize | >> 55853 | >> | | >> transient_lastDdlTime | >> 1461972677 | >> | | >> NULL | >> NULL | >> | # Storage Information | >> NULL | >> NULL | >> | SerDe Library: | >> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | >> NULL | >> | InputFormat: | >> org.apache.hadoop.mapred.TextInputFormat | >> NULL | >> | OutputFormat: | >> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | >> NULL | >> | Compressed: | >> No | >> NULL | >> | Num Buckets: | >> -1 | >> NULL | >> | Bucket Columns: | >> [] | >> NULL | >> | Sort Columns: | >> [] | >> NULL | >> | Storage Desc Params: | >> NULL | >> NULL | >> | | >> serialization.format | >> 1 | >> >> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ >> >> 0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme SET >> TBLPROPERTIES ('numRows'='-1');* >> >> 0: jdbc:hive2://rhes564:10010/default> desc formatted testme; >> >> >> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ >> | col_name | >> data_type | comment | >> >> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ >> | # col_name | >> data_type | >> comment | >> | | >> NULL | >> NULL | >> | prod_id | >> double >> | | >> | cust_id | >> double >> | | >> | time_id | >> string >> | | >> | channel_id | >> double >> | | >> | promo_id | >> double >> | | >> | quantity_sold | >> double >> | | >> | amount_sold | >> double >> | | >> | | >> NULL | >> NULL | >> | # Detailed Table Information | >> NULL | >> NULL | >> | Database: | >> oraclehadoop | >> NULL | >> | Owner: | >> hduser | >> NULL | >> | CreateTime: | Sat Apr 30 00:31:17 BST >> 2016 | NULL | >> | LastAccessTime: | >> UNKNOWN | >> NULL | >> | Retention: | >> 0 | >> NULL | >> | Location: | >> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme | >> NULL | >> | Table Type: | >> MANAGED_TABLE | >> NULL | >> | Table Parameters: | >> NULL | >> NULL | >> | | >> last_modified_by | >> hduser | >> | | >> last_modified_time | >> 1461973002 | >> | | >> numFiles | >> 1 | >> | | >> numRows | >> -1 | >> | | >> rawDataSize | >> 54853 | >> | | >> totalSize | >> 55853 | >> | | >> transient_lastDdlTime | >> 1461973002 | >> | | >> NULL | >> NULL | >> | # Storage Information | >> NULL | >> NULL | >> | SerDe Library: | >> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | >> NULL | >> | InputFormat: | >> org.apache.hadoop.mapred.TextInputFormat | >> NULL | >> | OutputFormat: | >> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | >> NULL | >> | Compressed: | >> No | >> NULL | >> | Num Buckets: | >> -1 | >> NULL | >> | Bucket Columns: | >> [] | >> NULL | >> | Sort Columns: | >> [] | >> NULL | >> | Storage Desc Params: | >> NULL | >> NULL | >> | | >> serialization.format | >> 1 | >> >> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ >> >> Hopefully that will turn off the autogather feature for existing tables. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 29 April 2016 at 23:32, Udit Mehta <ume...@groupon.com> wrote: >> >>> Hi, >>> >>> Thanks for the replies. >>> We have a scenario where we have an ETL job inserting into a table with >>> thousands of partitions using dynamic partitioning. We have certain SLA's >>> within which we would like the job to finish and sometimes there are >>> scenarios where they are missed (extra data or a busy cluster). I >>> understand that stats are essential for Hive CBO but we are trying to >>> explore how much overhead do these stats collection add to the job runtime. >>> A lot of these tables are intermediary tables so having stats for them >>> might not be entirely necessary. >>> >>> I just wanted to figure if there was a easy way to disable the stats and >>> then compare the performance. >>> >>> Mich, can you give more information on how to disable it in the table >>> struct as I cant find any documentation on it. >>> >>> Thanks again. >>> Udit >>> >>> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <pxi...@apache.org> >>> wrote: >>> >>>> Hi Udit, >>>> >>>> Could u be more specific about your problem? Like, what settings >>>> you have, what query you run and what is the result and what result do you >>>> expect? >>>> >>>> From what you said, my understanding is that, you want to wipe out >>>> the basic stats for existing tables? And, could u also let us know why you >>>> would like to get rid of the stats? Stats is crucial for Hive CBO to work >>>> and we are moving towards the direction to make table/column stats >>>> collection automatically. It seems that you prefer an opposite direction. >>>> There is nothing wrong here and we would like to listen to your idea and >>>> motivation so that we can better design Hive stats collection. Thanks! >>>> >>>> Best >>>> Pengcheng >>>> >>>> >>>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <ume...@groupon.com> wrote: >>>> >>>>> Any insights on this? >>>>> >>>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <ume...@groupon.com> >>>>> wrote: >>>>> >>>>>> Update: Realized this works if we create a fresh table with this >>>>>> config already disabled but does not work if there is already a table >>>>>> created when this config was enabled. We now need to figure out how to >>>>>> disable this config for a table created when this config was true. >>>>>> >>>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <ume...@groupon.com> >>>>>> wrote: >>>>>> >>>>>>> Hive version we are using is 1.2.1. >>>>>>> >>>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <ume...@groupon.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We need to disable the Hive autogather stats optimization by >>>>>>>> disabling "*hive.stats.autogather*" but for some reason, the >>>>>>>> config change doesnt seem to go through. We modified this config in the >>>>>>>> hive-site.xml and restarted the Hive metastore. We also made this >>>>>>>> change >>>>>>>> explicitly in the job but it doesnt seem to help. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *set hive.stats.autogather=false;* >>>>>>>> Does anyone know the right way to disable this config since we dont >>>>>>>> want to compute stats in out jobs. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Udit >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >