Thanks Owen,
I got a bit confused comparing ORC with what I know about indexes in relational 
databases. Still need to understand it a bit better.
Regards
From: Owen O'Malley [mailto:omal...@apache.org] 
Sent: 19 January 2016 17:57
To: user@hive.apache.org; Ashok Kumar <ashok34...@yahoo.com>
Cc: Jörn Franke <jornfra...@gmail.com>
Subject: Re: ORC files and statistics  On Tue, Jan 19, 2016 at 9:45 AM, Ashok 
Kumar <ashok34...@yahoo.com> wrote:
Thank you both.  So if I have a Hive table of ORC type and it contains 100K 
rows, there will be 10 row groups of 10K row each.
  Yes 
  within each row group there will be min, max, count(distint_value) and sum 
for each column within that row group. is count mean count of distinct values 
including null occurrence for that column?.
  Actually, it is just count, not count distinct. Newer versions of Hive also 
have the option of including bloom filters for some columns. That enables fast 
searches for particular values in columns that aren't sorted. 
  also if the table contains 5 columns will there be 5x10 row groups in total?
  The ORC files are laid out in stripes that correspond to roughly ~64MB 
compressed. Each column within a stripe is laid out together. The row groups 
are a feature of the index and correspond to how many entries the index has. So 
yes, within a file with 100k rows, which obviously will be a single stripe, the 
index will have 10 row groups for each column for a total of 50 entries in the 
index. (The index is also laid out in columns so the reader only loads the 
parts of the index it needs for the columns it is reading.)  .. Owen  
  

  

Reply via email to