Re: Storage handler guidance

Dean Arnold Tue, 24 Jun 2014 19:57:30 -0700

Nevermind, after scraping the sources I found the relevant bits to answer
my own question. InputFormat generates arbitrary InputSplit's which define
the partitioning of input data sources, and OutputFormat's just get spun up
in mappers/reducers, resulting in implicit partitioning.




On Thu, Jun 19, 2014 at 1:01 PM, Dean Arnold <[email protected]> wrote:

> I haven't been able to find an explicit reference, hoping some one can
> clarify for me:
>
> Do storage handler reads/write get executed as parallel resources, i.e.,
> in an INSERT...SELECT... from a storage handler, will multiple storage
> handler instances be created to read from the data source (using
> partitioning or some other scheme) ?
>
> Likewise, will INSERT into a storage handler be executed using multiple
> streams ?
>
> FYI: I need to stream data into/out of Hive from/to parallel-efficient
> data sources, and would prefer to avoid landing everything in HDFS 1st, esp
> if the ultimate Hive file format is ORC, i.e, avoid multiple file copies,
> esp when moving terabytes between data sources and sinks. The storage
> handler mechanism seems a very elegant solution *if* it supports true
> parallel stream operations.
>
> TIA,
> Dean
>

Re: Storage handler guidance

Reply via email to