Re: ORC files and statistics

Ashok Kumar Tue, 19 Jan 2016 09:46:05 -0800

Thank you both.
So if I have a Hive table of ORC type and it contains 100K rows, there will be 
10 row groups of 10K row each.
within each row group there will be min, max, count(distint_value) and sum for 
each column within that row group. is count mean count of distinct values 
including null occurrence for that column?.
also if the table contains 5 columns will there be 5x10 row groups in total?
thanks again


    On Tuesday, 19 January 2016, 17:35, Jörn Franke <jornfra...@gmail.com> 
wrote:
 

 Just be aware that you should insert the data sorted at least on the most 
discrimating column of your where clause
On 19 Jan 2016, at 17:27, Owen O'Malley <omal...@apache.org> wrote:


It has both. Each index has statistics of min, max, count, and sum for each 
column in the row group of 10,000 rows. It also has the location of the start 
of each row group, so that the reader can jump straight to the beginning of the 
row group. The reader takes a SearchArgument (eg. age > 100)  that limits which 
rows are required for the query and can avoid reading an entire file, or at 
least sections of the file.
.. Owen
On Tue, Jan 19, 2016 at 7:50 AM, Ashok Kumar <ashok34...@yahoo.com> wrote:

 Hi,
I have read some notes on ORC files in Hive and indexes.
The document describes in the indexes but makes reference to statistics
Indexes
|   |
|   |  |   |   |   |   |   |
| IndexesIndexes ORC provides three level of indexes within each file: file 
level - statistics about the values in each column across the entire file  |
|  |
| View on orc.apache.org | Preview by Yahoo |
|  |
|   |


I am confused as it is mixing up indexes with statistics. Can someone clarify 
these.
Thanks

Re: ORC files and statistics

Reply via email to