Thanks Allen.
I get your point about parallel I/Os. But what about data transfer on the
network. Is it the case that all the conversion is happening locally (I
mean the blocks of text data on a specific node will be stored on the same
node as ORC) ? Or some re-partitioning needs to happen ?
My major question is about data exchange on network. If it happens then
more data and more nodes will result in more transfer/partitioning time.

Thanks.

On Tue, Nov 17, 2015 at 9:39 AM, Alan Gates <alanfga...@gmail.com> wrote:

> The reads and writes both happen in parallel, so as more nodes are
> available for read and write, at least in this case, the time stays roughly
> the same.
>
> Alan.
>
> James Pirz <james.p...@gmail.com>
> November 16, 2015 at 21:23
> Hi,
>
> I am using Hive 1.2 with ORC tables on Hadoop 2.6 on a cluster.
> I load data into an ORC table by reading the data from an external table
> on raw text files and using insert statement:
>
> INSERT into TABLE myorctab SELECT * FROM mytxttab;
>
> I ran a simple scale-up test to find out how the loading time increases as
> I double the size of data and nodes. I realized that the total time remains
> more or less the same (scales properly).
>
> I am just wondering why this is happening, as naively I think if I make
> the number of partitions and size of data double, the time should also be
> roughly double as the system needs to partition twice amount of data as it
> was doing before among twice number of partitions. Am I missing something
> here ?
>
> Thnx
>
>

Reply via email to