[jira] [Updated] (HIVE-6234) Implement fast vectorized InputFormat extension for text files

Eric Hanson (JIRA) Mon, 20 Jan 2014 14:06:06 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eric Hanson updated HIVE-6234:
------------------------------

    Description: 
Implement support for vectorized scan input of text files (plain text with 
configurable record and field separators). This should work for CSV files, tab 
delimited files, etc. 

The goal is to provide high-performance reading of these files using vectorized 
scans, and also to do it as an extension of existing Hive. Then, if vectorized 
query is enabled, existing tables based on text files will be able to benefit 
immediately without the need to use a different input format. After upgrading 
to new Hive bits that support this, faster, vectorized processing over existing 
text tables should just work, when vectorization is enabled.

Another goal is to go beyond a simple layering of vectorized row batch iterator 
over the top of the existing row iterator. It should be possible to, say, read 
a chunk of data into a byte buffer (several thousand or even million rows), and 
then read data from it into vectorized row batches directly. Object creations 
should be minimized to save allocation time and GC overhead. If it is possible 
to save CPU for values like dates and numbers by caching the translation from 
string to the final data type, that should ideally be implemented.

  was:
Implement support for vectorized scan input of text files (plain text with 
configurable record and fields separators). This should work for CSV files, tab 
delimited files, etc. 

The goal is to provide high-performance reading of these files using vectorized 
scans, and also to do it as an extension of existing Hive. Then, if vectorized 
query is enabled, existing tables based on text files will be able to benefit 
immediately without the need to use a different input format.

Another goal is to go beyond a simple layering of vectorized row batch iterator 
over the top of the existing row iterator. It should be possible to, say, read 
a chunk of data into a byte buffer (several thousand or even million rows), and 
then read data from it into vectorized row batches directly. Object creations 
should be minimized to save allocation time and GC overhead. If it is possible 
to save CPU for values like dates and numbers by caching the translation from 
string to the final data type, that should ideally be implemented.


> Implement fast vectorized InputFormat extension for text files
> --------------------------------------------------------------
>
>                 Key: HIVE-6234
>                 URL: https://issues.apache.org/jira/browse/HIVE-6234
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Eric Hanson
>            Assignee: Eric Hanson
>
> Implement support for vectorized scan input of text files (plain text with 
> configurable record and field separators). This should work for CSV files, 
> tab delimited files, etc. 
> The goal is to provide high-performance reading of these files using 
> vectorized scans, and also to do it as an extension of existing Hive. Then, 
> if vectorized query is enabled, existing tables based on text files will be 
> able to benefit immediately without the need to use a different input format. 
> After upgrading to new Hive bits that support this, faster, vectorized 
> processing over existing text tables should just work, when vectorization is 
> enabled.
> Another goal is to go beyond a simple layering of vectorized row batch 
> iterator over the top of the existing row iterator. It should be possible to, 
> say, read a chunk of data into a byte buffer (several thousand or even 
> million rows), and then read data from it into vectorized row batches 
> directly. Object creations should be minimized to save allocation time and GC 
> overhead. If it is possible to save CPU for values like dates and numbers by 
> caching the translation from string to the final data type, that should 
> ideally be implemented.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-6234) Implement fast vectorized InputFormat extension for text files

Reply via email to