Thanks Owen,
I got a bit confused comparing ORC with what I know about indexes in relational
databases. Still need to understand it a bit better.
Regards
From: Owen O'Malley [mailto:omal...@apache.org]
Sent: 19 January 2016 17:57
To: user@hive.apache.org; Ashok Kumar <ashok34...@yahoo.com>
Cc: Jörn Franke <jornfra...@gmail.com>
Subject: Re: ORC files and statistics On Tue, Jan 19, 2016 at 9:45 AM, Ashok
Kumar <ashok34...@yahoo.com> wrote:
Thank you both. So if I have a Hive table of ORC type and it contains 100K
rows, there will be 10 row groups of 10K row each.
Yes
within each row group there will be min, max, count(distint_value) and sum
for each column within that row group. is count mean count of distinct values
including null occurrence for that column?.
Actually, it is just count, not count distinct. Newer versions of Hive also
have the option of including bloom filters for some columns. That enables fast
searches for particular values in columns that aren't sorted.
also if the table contains 5 columns will there be 5x10 row groups in total?
The ORC files are laid out in stripes that correspond to roughly ~64MB
compressed. Each column within a stripe is laid out together. The row groups
are a feature of the index and correspond to how many entries the index has. So
yes, within a file with 100k rows, which obviously will be a single stripe, the
index will have 10 row groups for each column for a total of 50 entries in the
index. (The index is also laid out in columns so the reader only loads the
parts of the index it needs for the columns it is reading.) .. Owen