Thanks Allen. I get your point about parallel I/Os. But what about data transfer on the network. Is it the case that all the conversion is happening locally (I mean the blocks of text data on a specific node will be stored on the same node as ORC) ? Or some re-partitioning needs to happen ? My major question is about data exchange on network. If it happens then more data and more nodes will result in more transfer/partitioning time.
Thanks. On Tue, Nov 17, 2015 at 9:39 AM, Alan Gates <alanfga...@gmail.com> wrote: > The reads and writes both happen in parallel, so as more nodes are > available for read and write, at least in this case, the time stays roughly > the same. > > Alan. > > James Pirz <james.p...@gmail.com> > November 16, 2015 at 21:23 > Hi, > > I am using Hive 1.2 with ORC tables on Hadoop 2.6 on a cluster. > I load data into an ORC table by reading the data from an external table > on raw text files and using insert statement: > > INSERT into TABLE myorctab SELECT * FROM mytxttab; > > I ran a simple scale-up test to find out how the loading time increases as > I double the size of data and nodes. I realized that the total time remains > more or less the same (scales properly). > > I am just wondering why this is happening, as naively I think if I make > the number of partitions and size of data double, the time should also be > roughly double as the system needs to partition twice amount of data as it > was doing before among twice number of partitions. Am I missing something > here ? > > Thnx > >