Stuck .need help I created a small table with multiple partition desc (id int ,term int) partitioned by id ,whenever I run analyze on any id I am getting perfectly good answers . I am unable to figure out the difference each file is making .
New table Table Parameters: transient_lastDdlTime 1406016417 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 inserting insert into table statdevp partition(id) select id,term from statdev where id is not null and term is not null analyze analyze table statdevp partition(id=11) compute statistics for columns id; I am able to see all values in part_col_stat for the partitions I am running analyze and the orginal table : desc Table Parameters: last_modified_by XXXXXXXXX last_modified_time 1406047797 transient_lastDdlTime 1406047797 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 load_inst_id int src_filename string server_date date # Partition Information # col_name data_type comment server_date date insertion insert into devStat partition(server_date) select load_inst_id,src_filename,server_ts,server_date from rrslog_ext where server_date='2013-11-01' and load_inst_id is not null limit 100 analyze analyze table devstat partition(server_date='2013-11-30') compute statistics for columns load_inst_id; any help will be highly appreciated .stuck here long time … thanks in advance . From: Navdeep Agrawal [mailto:navdeep_agra...@symantec.com] Sent: Wednesday, July 23, 2014 3:17 PM To: user@hive.apache.org Subject: RE: Hive Statistics No I have not set these to mysql db . when I set them to the one I am using for hive I am getting stat publisher not getting initialized .but if I have not set these parameters why every time a new row is getting created in mysql db in part_col_stats table . From: Andre Araujo [mailto:ara...@pythian.com] Sent: Wednesday, July 23, 2014 1:22 PM To: user Subject: Re: Hive Statistics Hi, Navdeep, Please note that the configuration for the stats database is separate from the configuration for the metastore db. Can you confirm you have both to use a mysql db? The properties for the stats db are: hive.stats.dbclass= hive.stats.dbconnectionstring= On 23 July 2014 16:07, Navdeep Agrawal <navdeep_agra...@symantec.com<mailto:navdeep_agra...@symantec.com>> wrote: Thank you Nitin for reply. I am using mysql database ,and also I can see new row created for the partition ,but all values are zero . I think explicitly giving mysql data base wont make a difference . From: Nitin Pawar [mailto:nitinpawar...@gmail.com<mailto:nitinpawar...@gmail.com>] Sent: Tuesday, July 22, 2014 11:05 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Hive Statistics by default hive stores the statistics in derby database. If you want a persistent look at column statistics, you may want to create mysql based database for column statistics. Your queries look fine On Tue, Jul 22, 2014 at 10:50 PM, Navdeep Agrawal <navdeep_agra...@symantec.com<mailto:navdeep_agra...@symantec.com>> wrote: Hi , i am trying to compute statistics on ORC File but i am unable see any changes in PART_COL_STATS as well on using set hive.compute.query.using.stats=true; set hive.stats.reliable=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; set hive.cbo.enable=true; to get max value of a column it is running full Map reduce on column .. what i want to use is max value stored in meta store ,but i am unable to catch these statistics . my table desc is load_inst_id int src_filename string server_date date my analyze query is analyze table mytable partition(server_date=’2013-11-30′) compute statistics for columns load_inst_id; i am always getting 0 as loadinstant id ,i have to turn off my hive.compute.query.using.stats to get correct result(through map reduce max(load_inst_id)) -- Nitin Pawar -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com<http://www.pythian.com> Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 OR +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com<mailto:ara...@pythian.com> @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) --