[ 
https://issues.apache.org/jira/browse/IMPALA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959764#comment-17959764
 ] 

Zoltán Borók-Nagy commented on IMPALA-8523:
-------------------------------------------

It is also possible to set file length as an option in the builder api, then 
s3a skips the HEAD check.

> Migrate hdfsOpen to builder-based openFile API
> ----------------------------------------------
>
>                 Key: IMPALA-8523
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8523
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Sahil Takiar
>            Priority: Major
>
> When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls 
> {{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the 
> HDFS-client now exposes a new API for opening files called {{openFile}}. The 
> new API has a few advantages (1) it is capable of specifying file specific 
> configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}} 
> for details), and (2) it can open files asynchronously (e.g. see 
> {{o.a.h.fs.FutureDataInputStreamBuilder}} for details.
> The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS 
> open calls). To avoid overlap between IMPALA-7738 and the async file opens in 
> {{openFile}}, HADOOP-15691 can be used to check which filesystems open files 
> asynchronously and which ones don't (currently only S3A opens files 
> asynchronously).
> The main use case for the new {{openFile}} API is Impala-S3 performance. 
> Performance benchmarks have shown that setting 
> {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can 
> significantly improve performance, however, this setting also adversely 
> affects scans of non-splittable file formats such as gzipped files (see 
> HADOOP-13203). One solution to this issue is to just document that setting 
> {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves 
> performance, however, a better solution would be to use the new {{openFile}} 
> API to specify different values of fadvise depending on the file type.
> This work is dependent on exposing the new {{openFile}} API via libhdfs 
> (HDFS-14478).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to