Nevermind, after scraping the sources I found the relevant bits to answer my own question. InputFormat generates arbitrary InputSplit's which define the partitioning of input data sources, and OutputFormat's just get spun up in mappers/reducers, resulting in implicit partitioning.
On Thu, Jun 19, 2014 at 1:01 PM, Dean Arnold <[email protected]> wrote: > I haven't been able to find an explicit reference, hoping some one can > clarify for me: > > Do storage handler reads/write get executed as parallel resources, i.e., > in an INSERT...SELECT... from a storage handler, will multiple storage > handler instances be created to read from the data source (using > partitioning or some other scheme) ? > > Likewise, will INSERT into a storage handler be executed using multiple > streams ? > > FYI: I need to stream data into/out of Hive from/to parallel-efficient > data sources, and would prefer to avoid landing everything in HDFS 1st, esp > if the ultimate Hive file format is ORC, i.e, avoid multiple file copies, > esp when moving terabytes between data sources and sinks. The storage > handler mechanism seems a very elegant solution *if* it supports true > parallel stream operations. > > TIA, > Dean >
