Yin - Fantastic! That is exactly the type of explanation of settings I'd like to see. More than just what it does, but the tradeoffs, and how things are applied in the real world. Have you played with the stride length at all?
On Wed, Nov 13, 2013 at 1:13 PM, Yin Huai <huaiyin....@gmail.com> wrote: > Hi John, > > Here is my experience on the stripe size. For a given table, when the > stripe size is increased, the size of a column in a stripe increases, which > means the ORC reader can read a column from disks in a more efficient way > because the reader can sequentially read more data (assuming the reader and > the HDFS block are co-located). But, a larger stripe size may decrease the > number of concurrent Map tasks reading an ORC file because a Map task needs > to process at least one stripe (seems a stripe is not splitable right now). > If you can get enough degree of parallelism, I think increasing the stripe > size generally gives you better data reading efficiency in one task. > However, on HDDs, the benefit from increasing the stripe size on data > reading efficiency in a Map task is getting smaller with the increase of > the stripe size. So, for a table with only a few columns (assuming a single > ORC file is used), using a smaller stripe size may not significantly affect > data reading efficiency in a task, and you can potentially have more > concurrent tasks to read this ORC file. So, I think you need to tradeoff > the data reading efficiency in a single task (larger stripe size -> better > data reading efficiency in a task) and the degree of parallelism (smaller > stripe size -> more concurrent tasks to read an ORC file) when determining > the right stripe size. > > btw, I have a paper studying file formats and it has some related > contents. Here is the link: > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-13-5.pdf > . > > Thanks, > > Yin > > > On Tue, Nov 12, 2013 at 8:51 PM, Lefty Leverenz > <leftylever...@gmail.com>wrote: > >> If you get some useful advice, let's improve the doc. >> >> -- Lefty >> >> >> On Tue, Nov 12, 2013 at 6:15 PM, John Omernik <j...@omernik.com> wrote: >> >>> I am looking for guidance (read examples) on tuning ORC settings for my >>> data. I see the documentation that shows the defaults, as well as a brief >>> description of what it is. What I am looking for is some examples of >>> things to try. *Note: I understand that nobody wants to make sweeping >>> declaring of set this setting without knowing the data* That said, I would >>> love to see some examples, specifically around: >>> >>> orc.row.index.stride >>> >>> orc.compress.size >>> >>> orc.stripe.size >>> >>> >>> For example, I'd love to see some statements like: >>> >>> >>> If your data has lots of columns of small data, and you'd like better x, >>> try changing y setting because this allows hive to do z when querying. >>> >>> >>> If your data has few columns of large data, try changing y and this >>> allows hive to do z while querying. >>> >>> >>> It would be really neat to see some examples so we can get in and tune >>> our data. Right now, everything is a crapshoot for me, and I don't know if >>> there are detrimental affects that may make themselves known later. >>> >>> >>> Any input would be welcome. >>> >> >> >