Sure you have to move the mind set to Hive.
In general in RDBMS an index is a different construct from the statistics kept in histograms for the table and index columns itself. For example in Oracle: Statistics for a column include: • Minimum value for the column • Maximum value for the column • Number of rows where the value of the column is null • Number of distinct values for the column • Optionally, one of two different flavours of histogram describing skews in the table. So by and large statistics for a column in a typical RDBMS is practically the same as statistics collected in ORC table for row group (batch of 10K rows), excluding in Hive the number of distinct values (mitigated by bloom filter in Hive as Owen alluded to). I believe there is no such thing as an “external index” of value in Hive. Sure you can create even bitmap indexes but they don’t add any value IMO. So the best we have is using startistics collected on ORC tables (AKA ORC internal indexes) to narrow down the query results and that is where the value of ORC table construct comes. In an RDBMS an index has the column value and RowID of the row in the table thus helping optimiser to quickly narrow down the result set through the index usage. That is what an internal index in Hive supposerd to do. HTH, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. From: Ashok Kumar [mailto:ashok34...@yahoo.com] Sent: 19 January 2016 20:36 To: User <user@hive.apache.org> Subject: Re: ORC files and statistics Thanks Owen, I got a bit confused comparing ORC with what I know about indexes in relational databases. Still need to understand it a bit better. Regards From: Owen O'Malley [mailto:omal...@apache.org] Sent: 19 January 2016 17:57 To: user@hive.apache.org <mailto:user@hive.apache.org> ; Ashok Kumar <ashok34...@yahoo.com <mailto:ashok34...@yahoo.com> > Cc: Jörn Franke <jornfra...@gmail.com <mailto:jornfra...@gmail.com> > Subject: Re: ORC files and statistics On Tue, Jan 19, 2016 at 9:45 AM, Ashok Kumar <ashok34...@yahoo.com <mailto:ashok34...@yahoo.com> > wrote: Thank you both. So if I have a Hive table of ORC type and it contains 100K rows, there will be 10 row groups of 10K row each. Yes within each row group there will be min, max, count(distint_value) and sum for each column within that row group. is count mean count of distinct values including null occurrence for that column?. Actually, it is just count, not count distinct. Newer versions of Hive also have the option of including bloom filters for some columns. That enables fast searches for particular values in columns that aren't sorted. also if the table contains 5 columns will there be 5x10 row groups in total? The ORC files are laid out in stripes that correspond to roughly ~64MB compressed. Each column within a stripe is laid out together. The row groups are a feature of the index and correspond to how many entries the index has. So yes, within a file with 100k rows, which obviously will be a single stripe, the index will have 10 row groups for each column for a total of 50 entries in the index. (The index is also laid out in columns so the reader only loads the parts of the index it needs for the columns it is reading.) .. Owen