[jira] [Resolved] (HADOOP-19199) Include FileStatus when opening a file from FileSystem

Steve Loughran (Jira) Mon, 10 Jun 2024 07:08:10 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-19199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran resolved HADOOP-19199.
-------------------------------------
    Resolution: Duplicate

Closing as a duplicate of HADOOP-15229. 

I absolutely agree the head request are needless. Which is why we added exactly 
the feature you wanted in 2019, *five years ago*. And in HADOOP-16202, you only 
need to pass in the file length, so if you can store that in your manifests, 
then you can skip the HEAD call (s3a; abfs still needs it).

The problem we have is therefore not that Hadoop library lacks this, it is that 
libraries and applications haven't taken it up. Why not? Because they want 
compile against versions of duke that are over 10 years old. Which means that 
all improvements we have done that are wasted. Although private forks can do 
this, it's a very hard to get this taken up consistently, and people like you 
and I suffer in wasted time and money.

What can be done? Well, I have concluded that trying to get the projects 
upgrade doesn't work, and waiting for the libraries to "get up-to-date" is a 
moving target as we are always trying to improve in this area. Instead, all our 
new work is being targeted at being "reflection-friendly" and expecting the 
initial take-up to be through reflection. In HADOOP-19131 I am exporting the 
existing openFile() API (which takes a builder and returns and asynchronously 
evaluated input stream) as an easy-to-reflect function

{code}
public static FSDataInputStream fileSystem_openFile(
      final FileSystem fs,
      final Path path,
      final String policy,
      final FileStatus status,
      final Long length,
      final Map<String, String> options) throws IOException {
{code}

The "policy" is also critical as it tells the storage layer what access policy 
you want, such as random or sequential. I'm going to add an explicit "parquet" 
policy here too, which hence to the library that footer caching would be good.

What can you do then? Other than just waiting for this to happen? Help us get 
this through the stack. We need it in: parquet, iceberg, spark, avro. 

Can you start by reviewing HADOOP-19131 and seeing how well you think it will 
integrate *and anything you can do in terms of Proof of Concept PRs using this 
patch*, so we can identify problems before the hadoop patch is merged.


> Include FileStatus when opening a file from FileSystem
> ------------------------------------------------------
>
>                 Key: HADOOP-19199
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19199
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 3.4.0
>            Reporter: Oliver Caballero Alvarez
>            Priority: Major
>              Labels: pull-request-available
>
> The FileSystem abstract class prevents that if you have information about the 
> FileStatus of a file, you use it to open that file, which means that in the 
> implementations of the open method, they have to request the FileStatus of 
> the same file again, making unnecessary requests.
> A very clear example is seen in today's latest version of the parquet-hadoop 
> implementation, where:
> https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java
> Although to create the implementation you had to consult the file to know its 
> FileStatus, when opening it only the path is included, since the FileSystem 
> implementation is the only thing it allows you to do. This implies that the 
> implementation will surely, in its open function, verify that the file exists 
> or what information the file has and perform the same operation again to 
> collect the FileStatus.
>  
> This would simply be resolved by taking the latest current version:
>  
> [https://github.com/apache/hadoop/blob/release-3.4.0-RC3/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java]
> and including the following:
>  
>   public FSDataInputStream open(FileStatus f) throws IOException {
>         return this.open(f.getPath(), 
> this.getConf().getInt("io.file.buffer.size", 4096));
>     }
>  
> This would imply that it is backward compatible with all current Filesystems, 
> but since it is in the implementation it could be used when this information 
> is already known.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Resolved] (HADOOP-19199) Include FileStatus when opening a file from FileSystem

Reply via email to