[
https://issues.apache.org/jira/browse/IMPALA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959764#comment-17959764
]
Zoltán Borók-Nagy commented on IMPALA-8523:
-------------------------------------------
It is also possible to set file length as an option in the builder api, then
s3a skips the HEAD check.
> Migrate hdfsOpen to builder-based openFile API
> ----------------------------------------------
>
> Key: IMPALA-8523
> URL: https://issues.apache.org/jira/browse/IMPALA-8523
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Sahil Takiar
> Priority: Major
>
> When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls
> {{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the
> HDFS-client now exposes a new API for opening files called {{openFile}}. The
> new API has a few advantages (1) it is capable of specifying file specific
> configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}}
> for details), and (2) it can open files asynchronously (e.g. see
> {{o.a.h.fs.FutureDataInputStreamBuilder}} for details.
> The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS
> open calls). To avoid overlap between IMPALA-7738 and the async file opens in
> {{openFile}}, HADOOP-15691 can be used to check which filesystems open files
> asynchronously and which ones don't (currently only S3A opens files
> asynchronously).
> The main use case for the new {{openFile}} API is Impala-S3 performance.
> Performance benchmarks have shown that setting
> {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can
> significantly improve performance, however, this setting also adversely
> affects scans of non-splittable file formats such as gzipped files (see
> HADOOP-13203). One solution to this issue is to just document that setting
> {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves
> performance, however, a better solution would be to use the new {{openFile}}
> API to specify different values of fadvise depending on the file type.
> This work is dependent on exposing the new {{openFile}} API via libhdfs
> (HDFS-14478).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]