RE: ORC files and statistics

Mich Talebzadeh Tue, 19 Jan 2016 13:52:21 -0800

Sure you have to move the mind set to Hive.


In general in RDBMS an index is a different construct from the statistics kept 
in histograms for the table and index columns itself. For example in Oracle:

 

Statistics for a column include:

 

•       Minimum value for the column

•       Maximum value for the column

•       Number of rows where the value of the column is null

•       Number of distinct values for the column

•       Optionally, one of two different flavours of histogram describing skews 
in the table. 

 

So by and large statistics for a column in a typical RDBMS  is practically the 
same as  statistics collected in ORC table for row group (batch of 10K rows), 
excluding in Hive the number of distinct values (mitigated by bloom filter in 
Hive as Owen alluded to).

 

I believe there is no such thing as an “external index” of value in Hive. Sure 
you can create even bitmap indexes but they don’t add any value IMO. So the 
best we have is using startistics collected on ORC tables (AKA ORC internal 
indexes) to narrow down the query results and that is where the value of ORC 
table construct comes. In an RDBMS an index has the column value and RowID of 
the row in the table thus helping optimiser to quickly narrow down the result 
set through the index usage. That is what an internal index in Hive supposerd 
to do.

 

HTH,

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Ashok Kumar [mailto:ashok34...@yahoo.com] 
Sent: 19 January 2016 20:36
To: User <user@hive.apache.org>
Subject: Re: ORC files and statistics

 

Thanks Owen,

 

I got a bit confused comparing ORC with what I know about indexes in relational 
databases. Still need to understand it a bit better.

 

Regards

 

From: Owen O'Malley [mailto:omal...@apache.org] 
Sent: 19 January 2016 17:57
To: user@hive.apache.org <mailto:user@hive.apache.org> ; Ashok Kumar 
<ashok34...@yahoo.com <mailto:ashok34...@yahoo.com> >
Cc: Jörn Franke <jornfra...@gmail.com <mailto:jornfra...@gmail.com> >
Subject: Re: ORC files and statistics

 

On Tue, Jan 19, 2016 at 9:45 AM, Ashok Kumar <ashok34...@yahoo.com 
<mailto:ashok34...@yahoo.com> > wrote:

Thank you both.

 

So if I have a Hive table of ORC type and it contains 100K rows, there will be 
10 row groups of 10K row each.

 

Yes

 

 

within each row group there will be min, max, count(distint_value) and sum for 
each column within that row group. is count mean count of distinct values 
including null occurrence for that column?.

 

Actually, it is just count, not count distinct. Newer versions of Hive also 
have the option of including bloom filters for some columns. That enables fast 
searches for particular values in columns that aren't sorted.

 

 

also if the table contains 5 columns will there be 5x10 row groups in total?

 

The ORC files are laid out in stripes that correspond to roughly ~64MB 
compressed. Each column within a stripe is laid out together. The row groups 
are a feature of the index and correspond to how many entries the index has. So 
yes, within a file with 100k rows, which obviously will be a single stripe, the 
index will have 10 row groups for each column for a total of 50 entries in the 
index. (The index is also laid out in columns so the reader only loads the 
parts of the index it needs for the columns it is reading.)

 

.. Owen

RE: ORC files and statistics

Reply via email to