steveloughran commented on code in PR #3079: URL: https://github.com/apache/parquet-java/pull/3079#discussion_r1863804254
########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java: ########## @@ -70,9 +93,38 @@ public long getLength() { return stat.getLen(); } + /** + * Open the file. + * <p>Uses {@code FileSystem.openFile()} so that + * the existing FileStatus can be passed down: saves a HEAD request on cloud storage. + * and ignored everywhere else. + * + * @return the input stream. + * + * @throws InterruptedIOException future was interrupted + * @throws IOException if something went wrong + * @throws RuntimeException any nested RTE thrown + */ @Override public SeekableInputStream newStream() throws IOException { - return HadoopStreams.wrap(fs.open(stat.getPath())); + FSDataInputStream stream; + try { + // this method is async so that implementations may do async HEAD head + // requests. Not done in S3A/ABFS when a file status passed down (as is done here) + final CompletableFuture<FSDataInputStream> future = fs.openFile(stat.getPath()) + .withFileStatus(stat) + .opt(OPENFILE_READ_POLICY_KEY, PARQUET_READ_POLICY) + .build(); + stream = awaitFuture(future); + } catch (RuntimeException e) { + // S3A < 3.3.5 would raise illegal path exception if the openFile path didn't + // equal the path in the FileStatus; Hive virtual FS could create this condition. + // As the path to open is derived from stat.getPath(), this condition seems + // near-impossible to create -but is handled here for due diligence. + stream = fs.open(stat.getPath()); Review Comment: good point. at debug? I'm never a fan of excessively noisy libraries as they ruin the lives of people downstream, especially if messages for developers end up being printed in user logs far too often. In fact, I'm not above dynamically patching log levels to shut those libraries up (HADOOP-19272)... stack trace then, debug. error text itself? Probably the same unless there's a desire to log exactly once during the life of a process, which could be done trivially -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org