Actually it is not that different from traditional relational databases, such as Oracle Exadata, which supports storage index and recommends for DWH scenarios to avoid the traditional indexes, which are more suitable for OLTP scenarios.
> On 19 Jan 2016, at 21:35, Ashok Kumar <ashok34...@yahoo.com> wrote: > > Thanks Owen, > > I got a bit confused comparing ORC with what I know about indexes in > relational databases. Still need to understand it a bit better. > > Regards > > From: Owen O'Malley [mailto:omal...@apache.org] > Sent: 19 January 2016 17:57 > To: user@hive.apache.org; Ashok Kumar <ashok34...@yahoo.com> > Cc: Jörn Franke <jornfra...@gmail.com> > Subject: Re: ORC files and statistics > > On Tue, Jan 19, 2016 at 9:45 AM, Ashok Kumar <ashok34...@yahoo.com> wrote: > Thank you both. > > So if I have a Hive table of ORC type and it contains 100K rows, there will > be 10 row groups of 10K row each. > > Yes > > > within each row group there will be min, max, count(distint_value) and sum > for each column within that row group. is count mean count of distinct values > including null occurrence for that column?. > > Actually, it is just count, not count distinct. Newer versions of Hive also > have the option of including bloom filters for some columns. That enables > fast searches for particular values in columns that aren't sorted. > > > also if the table contains 5 columns will there be 5x10 row groups in total? > > The ORC files are laid out in stripes that correspond to roughly ~64MB > compressed. Each column within a stripe is laid out together. The row groups > are a feature of the index and correspond to how many entries the index has. > So yes, within a file with 100k rows, which obviously will be a single > stripe, the index will have 10 row groups for each column for a total of 50 > entries in the index. (The index is also laid out in columns so the reader > only loads the parts of the index it needs for the columns it is reading.) > > .. Owen > > >