[ 
https://issues.apache.org/jira/browse/HIVE-23158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated HIVE-23158:
------------------------------------------
    Description: 
S3A filesystem client (inherited by Hadoop) supports the notion of input 
policies.
 These policies tune the behaviour of HTTP requests that are used for reading 
different filetypes such as TEXT or ORC.

For formats such as ORC and Parquet that do a lot of seek operations, there is 
an optimized RANDOM mode that reads files only partially instead of fully 
(default).

I am suggesting to add some extra logic as part of HiveInputFormat to make sure 
we optimize RecordReader requests for random IO when data is stored on S3A 
using formats such as ORC or Parquet.

  was:
S3A filesystem client (inherited by Hadoop) supports the notion of input 
policies.
These policies tune the behaviour of HTTP requests that are used for reading 
different filetypes such as TEXT or ORC.

For formats such as ORC and Parquet do a lot of seek operations, thus there is 
an optimized RANDOM mode that reads files only partially instead of fully 
(default).

I am suggesting to add some extra logic as part of HiveInputFormat to make sure 
we optimize for random IO when data is stored on S3A using formats such as ORC 
or Parquet.


> Optimize S3A recordReader policy for Random IO formats
> ------------------------------------------------------
>
>                 Key: HIVE-23158
>                 URL: https://issues.apache.org/jira/browse/HIVE-23158
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Panagiotis Garefalakis
>            Assignee: Panagiotis Garefalakis
>            Priority: Trivial
>              Labels: pull-request-available
>         Attachments: HIVE-23158.01.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> S3A filesystem client (inherited by Hadoop) supports the notion of input 
> policies.
>  These policies tune the behaviour of HTTP requests that are used for reading 
> different filetypes such as TEXT or ORC.
> For formats such as ORC and Parquet that do a lot of seek operations, there 
> is an optimized RANDOM mode that reads files only partially instead of fully 
> (default).
> I am suggesting to add some extra logic as part of HiveInputFormat to make 
> sure we optimize RecordReader requests for random IO when data is stored on 
> S3A using formats such as ORC or Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to