Please find answers inline. Thanks Prasanth _____________________________ From: Alberto Ramón <a.ramonporto...@gmail.com<mailto:a.ramonporto...@gmail.com>> Sent: Friday, March 31, 2017 9:32 AM Subject: ORC slipt To: <user@hive.apache.org<mailto:user@hive.apache.org>>
Some doubts about ORC: 1- hive.exec.orc.default.buffer.size is used for read or write? Configurable only during write. Writer writes this buffer size into the footer which readers use during decompression. 2- orc.stripe.size is compressed or uncompresed? Both. Stripe size is essentially sum of all buffers of all columns (also dictionary size) held in memory. 3- orc.stripe.size must be multiple of HDFS block size? It is optimal to have it as multiple of hdfs block size else writer will adjust the last stripe size within a block so as to not straddle hdfs block boundary or pad the remaining space if it is less than 5% of block size. Note that hdfs block size can be configurable via orc.block.size and is independent of cluster wide block size. Default stripe size is 64 mb and block size is 256mb. 4- For read ORC file , the numbers of mappers depends onr HDFS blocks or Stripe number? Depends. If predicate pushdown is enabled each split could have one or more stripes. If predicate pushdown is disabled adjacent stripes are grouped together until hdfs block size to form a single split. Let's say, we have 3 stripes and if 2nd stripe does not satisfy the predicate then 1st and 3rd stripe will become 2 separate splits and 2nd stripe will be ignored. If predicate pushdown is disabled, all 3 stripes will together form a single split as it is less than block boundary. Number of splits will vary based on input format and execution engine. 5- hive.exec.orc.split.strategy is used for read? Yes.