Hi Prasanth,
In your statement below “ORC will keep buffering the rows in memory. Orc has its own memory manager which will monitor the memory consumption every 5000 rows. If the memory consumption exceeds the stripe size (in this case 16MB) then it will flush the stripe. “ I gather this happens when the data is read from ORC stripes. To be clear does this mean that rows are read in batch 0f 5000 into memory/cache until the memory consumed equals the size of stripe then the rows in mermory will be flushed back to disk.? On that basis does this imply that at any time the max rows that can be read into buffer will be one stripe size regardless? Thanks, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. From: Prasanth Jayachandran [mailto:pjayachand...@hortonworks.com] Sent: 20 January 2016 18:53 To: user@hive.apache.org Subject: Re: ORC table stats from orcfiledump Please find answers inline. Thanks Prasanth On Jan 20, 2016, at 12:32 PM, Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk> > wrote: Hi, I did some tests on ORC table by creating a simple ORC table as below CREATE TABLE orctest ( PROD_ID bigint , CUST_ID bigint , TIME_ID timestamp , CHANNEL_ID bigint , PROMO_ID bigint , QUANTITY_SOLD decimal(10) , AMOUNT_SOLD decimal(10) ) CLUSTERED BY (PROD_ID) INTO 256 BUCKETS STORED AS ORC --TBLPROPERTIES ( "orc.compress"="SNAPPY", TBLPROPERTIES ( "orc.create.index"="true", "orc.stripe.size"="16777216", "orc.row.index.stride"="10000" ) ; show create table orctest; insert into orctest select * from smallsales limit 5000000 ; analyze table orctest compute statistics; I deliberately set the stripe size to be 16MB and row index size to be 10,000 Table is created with 5Million rows as below NFO : Table oraclehadoop.orctest stats: [numFiles=1, numRows=5000000, totalSize=32140689, rawDataSize=1480000000] Now if I dump the content of ORC file I see as below Questions I have: 1. ORC will use be default ZLIB. That means that oRC file will use some fo Looks like your question is incomplete. 2. I specified the stripe size of 16MB. It has created 3 stripes of 172,0000 + 172,0000 + 156,000 rows. Does it decide on the number of stripes based on the total table size which seems to be 30MB (File length: 32140689 bytes) ORC will keep buffering the rows in memory. Orc has its own memory manager which will monitor the memory consumption every 5000 rows. If the memory consumption exceeds the stripe size (in this case 16MB) then it will flush the stripe. 3. There is no mention of statistics for row index in batch of 10,000? Is this correct Depending on which version of hive you are using you can use the orcfiledump option —rowindex followed by csv list of columns for which you want to print the row index. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility You should be seeing something like this https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/orc_file_dump.q.out#L187 4. Finally under Stripes:what are those statistics? Are they offset values? Each stripe consists of columns from the schema and each column consists of “streams”. The “Stripes” section first prints out the starting offset of the stripe, length of data segment, length of stripe footer and length of index. First segment within a stripe is index followed by data segment and finally the footer. The data segment consists are columns laid one after another and within each column there are streams. The name, offset and length of each stream are also mentioned in the “Stripes” section. For example. column 3 is of type timestamp. There are 3 streams for timestamp column ROW_INDEX (which stores min/max stats, position information etc.), DATA stream stores time in seconds and SECONDARY stream stores nano seconds. If you have enabled bloom filter then you will also see BLOOM_FILTER stream following the ROW_INDEX stream. You can see the BLOOM_FILTER stream here https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/orc_file_dump.q.out#L154 hive --orcfiledump /user/hive/warehouse/oraclehadoop.db/orctest/000000_0 Compression: ZLIB Compression size: 262144 Type: struct<_col0:bigint,_col1:bigint,_col2:timestamp,_col3:bigint,_col4:bigint,_col5:decimal(10,0),_col6:decimal(10,0)> Stripe Statistics: Stripe 1: Column 0: count: 1720000 hasNull: false Column 1: count: 1720000 hasNull: false min: 13 max: 148 sum: 134770570 Column 2: count: 1720000 hasNull: false min: 2 max: 101000 sum: 12541768461 Column 3: count: 1720000 hasNull: false min: 883612800000 max: 1009756800000 Column 4: count: 1720000 hasNull: false min: 2 max: 9 sum: 4923112 Column 5: count: 1720000 hasNull: false min: 33 max: 999 sum: 1678552095 Column 6: count: 1720000 hasNull: false min: 1 max: 1 sum: 1720000 Column 7: count: 1720000 hasNull: false min: 6 max: 1783 sum: 184490864 Stripe 2: Column 0: count: 1720000 hasNull: false Column 1: count: 1720000 hasNull: false min: 13 max: 148 sum: 134793536 Column 2: count: 1720000 hasNull: false min: 2 max: 101000 sum: 12545167155 Column 3: count: 1720000 hasNull: false min: 883612800000 max: 1009756800000 Column 4: count: 1720000 hasNull: false min: 2 max: 9 sum: 4924236 Column 5: count: 1720000 hasNull: false min: 33 max: 999 sum: 1678720214 Column 6: count: 1720000 hasNull: false min: 1 max: 1 sum: 1720000 Column 7: count: 1720000 hasNull: false min: 6 max: 1783 sum: 184666196 Stripe 3: Column 0: count: 1560000 hasNull: false Column 1: count: 1560000 hasNull: false min: 13 max: 148 sum: 122265414 Column 2: count: 1560000 hasNull: false min: 2 max: 101000 sum: 11367296072 Column 3: count: 1560000 hasNull: false min: 883612800000 max: 1009756800000 Column 4: count: 1560000 hasNull: false min: 2 max: 9 sum: 4465701 Column 5: count: 1560000 hasNull: false min: 33 max: 999 sum: 1522502365 Column 6: count: 1560000 hasNull: false min: 1 max: 1 sum: 1560000 Column 7: count: 1560000 hasNull: false min: 6 max: 1783 sum: 167524289 File Statistics: Column 0: count: 5000000 hasNull: false Column 1: count: 5000000 hasNull: false min: 13 max: 148 sum: 391829520 Column 2: count: 5000000 hasNull: false min: 2 max: 101000 sum: 36454231688 Column 3: count: 5000000 hasNull: false min: 883612800000 max: 1009756800000 Column 4: count: 5000000 hasNull: false min: 2 max: 9 sum: 14313049 Column 5: count: 5000000 hasNull: false min: 33 max: 999 sum: 4879774674 Column 6: count: 5000000 hasNull: false min: 1 max: 1 sum: 5000000 Column 7: count: 5000000 hasNull: false min: 6 max: 1783 sum: 536681349 Stripes: Stripe: offset: 3 data: 11039881 rows: 1720000 tail: 137 index: 14370 Stream: column 0 section ROW_INDEX start: 3 length 30 Stream: column 1 section ROW_INDEX start: 33 length 1853 Stream: column 2 section ROW_INDEX start: 1886 length 2415 Stream: column 3 section ROW_INDEX start: 4301 length 2284 Stream: column 4 section ROW_INDEX start: 6585 length 1451 Stream: column 5 section ROW_INDEX start: 8036 length 1674 Stream: column 6 section ROW_INDEX start: 9710 length 1714 Stream: column 7 section ROW_INDEX start: 11424 length 2949 Stream: column 1 section DATA start: 14373 length 1797330 Stream: column 2 section DATA start: 1811703 length 3280925 Stream: column 3 section DATA start: 5092628 length 3592057 Stream: column 3 section SECONDARY start: 8684685 length 1797 Stream: column 4 section DATA start: 8686482 length 590101 Stream: column 5 section DATA start: 9276583 length 124066 Stream: column 6 section DATA start: 9400649 length 7653 Stream: column 6 section SECONDARY start: 9408302 length 1797 Stream: column 7 section DATA start: 9410099 length 1642358 Stream: column 7 section SECONDARY start: 11052457 length 1797 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DIRECT_V2 Encoding column 4: DIRECT_V2 Encoding column 5: DIRECT_V2 Encoding column 6: DIRECT_V2 Encoding column 7: DIRECT_V2 Stripe: offset: 11054391 data: 11041929 rows: 1720000 tail: 137 index: 14292 Stream: column 0 section ROW_INDEX start: 11054391 length 30 Stream: column 1 section ROW_INDEX start: 11054421 length 1858 Stream: column 2 section ROW_INDEX start: 11056279 length 2398 Stream: column 3 section ROW_INDEX start: 11058677 length 2242 Stream: column 4 section ROW_INDEX start: 11060919 length 1443 Stream: column 5 section ROW_INDEX start: 11062362 length 1669 Stream: column 6 section ROW_INDEX start: 11064031 length 1714 Stream: column 7 section ROW_INDEX start: 11065745 length 2938 Stream: column 1 section DATA start: 11068683 length 1797175 Stream: column 2 section DATA start: 12865858 length 3283080 Stream: column 3 section DATA start: 16148938 length 3591894 Stream: column 3 section SECONDARY start: 19740832 length 1797 Stream: column 4 section DATA start: 19742629 length 590201 Stream: column 5 section DATA start: 20332830 length 123640 Stream: column 6 section DATA start: 20456470 length 7653 Stream: column 6 section SECONDARY start: 20464123 length 1797 Stream: column 7 section DATA start: 20465920 length 1642895 Stream: column 7 section SECONDARY start: 22108815 length 1797 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DIRECT_V2 Encoding column 4: DIRECT_V2 Encoding column 5: DIRECT_V2 Encoding column 6: DIRECT_V2 Encoding column 7: DIRECT_V2 Stripe: offset: 22110749 data: 10016151 rows: 1560000 tail: 138 index: 13096 Stream: column 0 section ROW_INDEX start: 22110749 length 30 Stream: column 1 section ROW_INDEX start: 22110779 length 1704 Stream: column 2 section ROW_INDEX start: 22112483 length 2201 Stream: column 3 section ROW_INDEX start: 22114684 length 2051 Stream: column 4 section ROW_INDEX start: 22116735 length 1333 Stream: column 5 section ROW_INDEX start: 22118068 length 1534 Stream: column 6 section ROW_INDEX start: 22119602 length 1552 Stream: column 7 section ROW_INDEX start: 22121154 length 2691 Stream: column 1 section DATA start: 22123845 length 1628303 Stream: column 2 section DATA start: 23752148 length 2980652 Stream: column 3 section DATA start: 26732800 length 3257688 Stream: column 3 section SECONDARY start: 29990488 length 1633 Stream: column 4 section DATA start: 29992121 length 535364 Stream: column 5 section DATA start: 30527485 length 112530 Stream: column 6 section DATA start: 30640015 length 6934 Stream: column 6 section SECONDARY start: 30646949 length 1633 Stream: column 7 section DATA start: 30648582 length 1489781 Stream: column 7 section SECONDARY start: 32138363 length 1633 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DIRECT_V2 Encoding column 4: DIRECT_V2 Encoding column 5: DIRECT_V2 Encoding column 6: DIRECT_V2 Encoding column 7: DIRECT_V2 File length: 32140689 bytes Padding length: 0 bytes Padding ratio: 0% Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.