Thanks Prasanth for these helpful answers

 

 

Just to clarify 

 

1.    ORC will use be default ZLIB. That means that oRC file will use some fo

 

Looks like your question is incomplete. 





It was meant to say ORC will use be default ZLIB. That means that ORC file will 
use default compress of ZLIB

 

With regard to printing rowindex info. My version of Hive is 1.2.1 so 
effectively I can use for column ID = 0 the following

 

hive --orcfiledump --rowindex 1,2 
/user/hive/warehouse/oraclehadoop.db/orctest/000000_0

 

------

 

    Row group indices for column 1:

      Entry 0: count: 10000 hasNull: false min: 13 max: 148 sum: 779823 
positions: 0,0,0

      Entry 1: count: 10000 hasNull: false min: 13 max: 148 sum: 778310 
positions: 0,19725,58

      Entry 2: count: 10000 hasNull: false min: 13 max: 148 sum: 786886 
positions: 0,39656,38

      Entry 3: count: 10000 hasNull: false min: 13 max: 148 sum: 783344 
positions: 0,59084,258

      Entry 4: count: 10000 hasNull: false min: 13 max: 148 sum: 784912 
positions: 0,78592,451

      Entry 5: count: 10000 hasNull: false min: 13 max: 148 sum: 788195 
positions: 0,99270,9

      Entry 6: count: 10000 hasNull: false min: 13 max: 148 sum: 787092 
positions: 0,118972,62

      Entry 7: count: 10000 hasNull: false min: 13 max: 148 sum: 786975 
positions: 0,138221,341

      Entry 8: count: 10000 hasNull: false min: 13 max: 148 sum: 783331 
positions: 0,158534,122

      Entry 9: count: 10000 hasNull: false min: 13 max: 148 sum: 783495 
positions: 0,177872,364

 

             Entry 144: count: 10000 hasNull: false min: 13 max: 148 sum: 
780480 positions: 1379619,233684,287

      Entry 145: count: 10000 hasNull: false min: 13 max: 148 sum: 784335 
positions: 1379619,253973,77

      Entry 146: count: 10000 hasNull: false min: 13 max: 148 sum: 781325 
positions: 1517743,11373,250

      Entry 147: count: 10000 hasNull: false min: 13 max: 148 sum: 787695 
positions: 1517743,31528,118

      Entry 148: count: 10000 hasNull: false min: 13 max: 148 sum: 787770 
positions: 1517743,51579,17

      Entry 149: count: 10000 hasNull: false min: 13 max: 148 sum: 782642 
positions: 1517743,70753,325

      Entry 150: count: 10000 hasNull: false min: 13 max: 148 sum: 786487 
positions: 1517743,90362,418

      Entry 151: count: 10000 hasNull: false min: 13 max: 148 sum: 783408 
positions: 1517743,110981,69

      Entry 152: count: 10000 hasNull: false min: 13 max: 148 sum: 783365 
positions: 1517743,130708,81

      Entry 153: count: 10000 hasNull: false min: 13 max: 148 sum: 782312 
positions: 1517743,150598,46

      Entry 154: count: 10000 hasNull: false min: 13 max: 148 sum: 783407 
positions: 1517743,169883,319

      Entry 155: count: 10000 hasNull: false min: 13 max: 148 sum: 785221 
positions: 1517743,189838,275

 

      Row group indices for column 2:

      Entry 0: count: 10000 hasNull: false min: 2 max: 100985 sum: 72798735 
positions: 0,0,0

      Entry 1: count: 10000 hasNull: false min: 3 max: 100984 sum: 73305148 
positions: 0,18624,272

 

And this is repeated for every column in row group of 10,000 rows. However, I 
would have expected 5,000,000 /10,000 = 500 entries but I only see 155!

 

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Prasanth Jayachandran [mailto:pjayachand...@hortonworks.com] 
Sent: 20 January 2016 18:53
To: user@hive.apache.org
Subject: Re: ORC table stats from orcfiledump

 

Please find answers inline. 

 

Thanks

Prasanth

 

On Jan 20, 2016, at 12:32 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

 

Hi,

 

I did some tests on ORC table by creating a simple ORC table as below

 

CREATE TABLE orctest

(

  PROD_ID        bigint                       ,

  CUST_ID        bigint                       ,

  TIME_ID        timestamp                    ,

  CHANNEL_ID     bigint                       ,

  PROMO_ID       bigint                       ,

  QUANTITY_SOLD  decimal(10)                  ,

  AMOUNT_SOLD    decimal(10)

)

CLUSTERED BY (PROD_ID) INTO 256 BUCKETS

STORED AS ORC

--TBLPROPERTIES ( "orc.compress"="SNAPPY",

TBLPROPERTIES (

"orc.create.index"="true",

"orc.stripe.size"="16777216",

"orc.row.index.stride"="10000" )

;

show create table orctest;

insert into orctest

select * from smallsales limit 5000000

;

analyze table orctest compute statistics;

 

 

 

I deliberately set the stripe size to be 16MB and row index size to be 10,000

 

Table is created with 5Million rows as below

 

NFO  : Table oraclehadoop.orctest stats: [numFiles=1, numRows=5000000, 
totalSize=32140689, rawDataSize=1480000000]

 

 

Now if I dump the content of ORC file I see as below

 

Questions I have:

 

1.    ORC will use be default ZLIB. That means that oRC file will use some fo

 

Looks like your question is incomplete. 





2. I specified the stripe size of 16MB. It has created 3 stripes of 172,0000 + 
172,0000 + 156,000 rows. Does it decide on the number of stripes based on the 
total table size which seems to be 30MB (File length: 32140689 bytes)

 

ORC will keep buffering the rows in memory. Orc has its own memory manager 
which will monitor the memory consumption every 5000 rows. If the memory 
consumption exceeds the stripe size (in this case 16MB) then it will flush the 
stripe. 





3.    There is no mention of statistics for row index in batch of 10,000? Is 
this correct

 

Depending on which version of hive you are using you can use the orcfiledump 
option —rowindex followed by csv list of columns for which you want to print 
the row index. 

 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility

 

You should be seeing something like this

https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/orc_file_dump.q.out#L187





4. Finally under Stripes:what are those statistics? Are they offset values?

 

 

Each stripe consists of columns from the schema and each column consists of 
“streams”. The “Stripes” section first prints out the starting offset of the 
stripe, length of data segment, length of stripe footer and length of index. 
First segment within a stripe is index followed by data segment and finally the 
footer. The data segment consists are columns laid one after another and within 
each column there are streams. The name, offset and length of each stream are 
also mentioned in the “Stripes” section. 

 

For example. column 3 is of type timestamp. There are 3 streams for timestamp 
column ROW_INDEX (which stores min/max stats, position information etc.), DATA 
stream stores time in seconds and SECONDARY stream stores nano seconds. If you 
have enabled bloom filter then you will also see BLOOM_FILTER stream following 
the ROW_INDEX stream. You can see the BLOOM_FILTER stream here 
https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/orc_file_dump.q.out#L154





 

 

hive --orcfiledump  /user/hive/warehouse/oraclehadoop.db/orctest/000000_0

 

 

Compression: ZLIB

Compression size: 262144

Type: 
struct<_col0:bigint,_col1:bigint,_col2:timestamp,_col3:bigint,_col4:bigint,_col5:decimal(10,0),_col6:decimal(10,0)>

 

Stripe Statistics:

  Stripe 1:

    Column 0: count: 1720000 hasNull: false

    Column 1: count: 1720000 hasNull: false min: 13 max: 148 sum: 134770570

    Column 2: count: 1720000 hasNull: false min: 2 max: 101000 sum: 12541768461

    Column 3: count: 1720000 hasNull: false min: 883612800000 max: 1009756800000

    Column 4: count: 1720000 hasNull: false min: 2 max: 9 sum: 4923112

    Column 5: count: 1720000 hasNull: false min: 33 max: 999 sum: 1678552095

    Column 6: count: 1720000 hasNull: false min: 1 max: 1 sum: 1720000

    Column 7: count: 1720000 hasNull: false min: 6 max: 1783 sum: 184490864

  Stripe 2:

    Column 0: count: 1720000 hasNull: false

    Column 1: count: 1720000 hasNull: false min: 13 max: 148 sum: 134793536

    Column 2: count: 1720000 hasNull: false min: 2 max: 101000 sum: 12545167155

    Column 3: count: 1720000 hasNull: false min: 883612800000 max: 1009756800000

    Column 4: count: 1720000 hasNull: false min: 2 max: 9 sum: 4924236

    Column 5: count: 1720000 hasNull: false min: 33 max: 999 sum: 1678720214

    Column 6: count: 1720000 hasNull: false min: 1 max: 1 sum: 1720000

    Column 7: count: 1720000 hasNull: false min: 6 max: 1783 sum: 184666196

  Stripe 3:

    Column 0: count: 1560000 hasNull: false

    Column 1: count: 1560000 hasNull: false min: 13 max: 148 sum: 122265414

    Column 2: count: 1560000 hasNull: false min: 2 max: 101000 sum: 11367296072

    Column 3: count: 1560000 hasNull: false min: 883612800000 max: 1009756800000

    Column 4: count: 1560000 hasNull: false min: 2 max: 9 sum: 4465701

    Column 5: count: 1560000 hasNull: false min: 33 max: 999 sum: 1522502365

    Column 6: count: 1560000 hasNull: false min: 1 max: 1 sum: 1560000

    Column 7: count: 1560000 hasNull: false min: 6 max: 1783 sum: 167524289

 

File Statistics:

  Column 0: count: 5000000 hasNull: false

  Column 1: count: 5000000 hasNull: false min: 13 max: 148 sum: 391829520

  Column 2: count: 5000000 hasNull: false min: 2 max: 101000 sum: 36454231688

  Column 3: count: 5000000 hasNull: false min: 883612800000 max: 1009756800000

  Column 4: count: 5000000 hasNull: false min: 2 max: 9 sum: 14313049

  Column 5: count: 5000000 hasNull: false min: 33 max: 999 sum: 4879774674

  Column 6: count: 5000000 hasNull: false min: 1 max: 1 sum: 5000000

  Column 7: count: 5000000 hasNull: false min: 6 max: 1783 sum: 536681349

 

Stripes:

  Stripe: offset: 3 data: 11039881 rows: 1720000 tail: 137 index: 14370

    Stream: column 0 section ROW_INDEX start: 3 length 30

    Stream: column 1 section ROW_INDEX start: 33 length 1853

    Stream: column 2 section ROW_INDEX start: 1886 length 2415

    Stream: column 3 section ROW_INDEX start: 4301 length 2284

    Stream: column 4 section ROW_INDEX start: 6585 length 1451

    Stream: column 5 section ROW_INDEX start: 8036 length 1674

    Stream: column 6 section ROW_INDEX start: 9710 length 1714

    Stream: column 7 section ROW_INDEX start: 11424 length 2949

    Stream: column 1 section DATA start: 14373 length 1797330

    Stream: column 2 section DATA start: 1811703 length 3280925

    Stream: column 3 section DATA start: 5092628 length 3592057

    Stream: column 3 section SECONDARY start: 8684685 length 1797

    Stream: column 4 section DATA start: 8686482 length 590101

    Stream: column 5 section DATA start: 9276583 length 124066

    Stream: column 6 section DATA start: 9400649 length 7653

    Stream: column 6 section SECONDARY start: 9408302 length 1797

    Stream: column 7 section DATA start: 9410099 length 1642358

    Stream: column 7 section SECONDARY start: 11052457 length 1797

    Encoding column 0: DIRECT

    Encoding column 1: DIRECT_V2

    Encoding column 2: DIRECT_V2

    Encoding column 3: DIRECT_V2

    Encoding column 4: DIRECT_V2

    Encoding column 5: DIRECT_V2

    Encoding column 6: DIRECT_V2

    Encoding column 7: DIRECT_V2

  Stripe: offset: 11054391 data: 11041929 rows: 1720000 tail: 137 index: 14292

    Stream: column 0 section ROW_INDEX start: 11054391 length 30

    Stream: column 1 section ROW_INDEX start: 11054421 length 1858

    Stream: column 2 section ROW_INDEX start: 11056279 length 2398

    Stream: column 3 section ROW_INDEX start: 11058677 length 2242

    Stream: column 4 section ROW_INDEX start: 11060919 length 1443

    Stream: column 5 section ROW_INDEX start: 11062362 length 1669

    Stream: column 6 section ROW_INDEX start: 11064031 length 1714

    Stream: column 7 section ROW_INDEX start: 11065745 length 2938

    Stream: column 1 section DATA start: 11068683 length 1797175

    Stream: column 2 section DATA start: 12865858 length 3283080

    Stream: column 3 section DATA start: 16148938 length 3591894

    Stream: column 3 section SECONDARY start: 19740832 length 1797

    Stream: column 4 section DATA start: 19742629 length 590201

    Stream: column 5 section DATA start: 20332830 length 123640

    Stream: column 6 section DATA start: 20456470 length 7653

    Stream: column 6 section SECONDARY start: 20464123 length 1797

    Stream: column 7 section DATA start: 20465920 length 1642895

    Stream: column 7 section SECONDARY start: 22108815 length 1797

    Encoding column 0: DIRECT

    Encoding column 1: DIRECT_V2

    Encoding column 2: DIRECT_V2

    Encoding column 3: DIRECT_V2

    Encoding column 4: DIRECT_V2

    Encoding column 5: DIRECT_V2

    Encoding column 6: DIRECT_V2

    Encoding column 7: DIRECT_V2

  Stripe: offset: 22110749 data: 10016151 rows: 1560000 tail: 138 index: 13096

    Stream: column 0 section ROW_INDEX start: 22110749 length 30

    Stream: column 1 section ROW_INDEX start: 22110779 length 1704

    Stream: column 2 section ROW_INDEX start: 22112483 length 2201

    Stream: column 3 section ROW_INDEX start: 22114684 length 2051

    Stream: column 4 section ROW_INDEX start: 22116735 length 1333

    Stream: column 5 section ROW_INDEX start: 22118068 length 1534

    Stream: column 6 section ROW_INDEX start: 22119602 length 1552

    Stream: column 7 section ROW_INDEX start: 22121154 length 2691

    Stream: column 1 section DATA start: 22123845 length 1628303

    Stream: column 2 section DATA start: 23752148 length 2980652

    Stream: column 3 section DATA start: 26732800 length 3257688

    Stream: column 3 section SECONDARY start: 29990488 length 1633

    Stream: column 4 section DATA start: 29992121 length 535364

    Stream: column 5 section DATA start: 30527485 length 112530

    Stream: column 6 section DATA start: 30640015 length 6934

    Stream: column 6 section SECONDARY start: 30646949 length 1633

    Stream: column 7 section DATA start: 30648582 length 1489781

    Stream: column 7 section SECONDARY start: 32138363 length 1633

    Encoding column 0: DIRECT

    Encoding column 1: DIRECT_V2

    Encoding column 2: DIRECT_V2

    Encoding column 3: DIRECT_V2

    Encoding column 4: DIRECT_V2

    Encoding column 5: DIRECT_V2

    Encoding column 6: DIRECT_V2

    Encoding column 7: DIRECT_V2

 

File length: 32140689 bytes

Padding length: 0 bytes

Padding ratio: 0%

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

Reply via email to