[ https://issues.apache.org/jira/browse/HADOOP-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arnaud Nauwynck resolved HADOOP-19345. -------------------------------------- Resolution: Duplicate > AzureBlobFileSystem.open() should override readVectored() much more > efficiently for small reads > ----------------------------------------------------------------------------------------------- > > Key: HADOOP-19345 > URL: https://issues.apache.org/jira/browse/HADOOP-19345 > Project: Hadoop Common > Issue Type: Improvement > Components: tools > Reporter: Arnaud Nauwynck > Priority: Major > > In hadoop-azure, there are huge performance problems when reading file in a > too fragmented way: by reading many small file fragments even with the > readVectored() Hadoop API, resulting in distinct Https Requests (=TCP-IP > connection established + TLS handshake + requests). > Internally, at lowest level, haddop azure is using class HttpURLConnection > from jdk 1.0, and the ReadAhead Threads do not sufficiently solve all > problems. > The hadoop azure implementation of "readVectored()" should make a compromise > between reading extra ignored data wholes, and establishing too many https > connections. > Currently, the class AzureBlobFileSystem#open() does return a default > inneficient imlpementation of readVectored: > {code:java} > private FSDataInputStream open(final Path path, > final Optional<OpenFileParameters> parameters) throws IOException { > ... > InputStream inputStream = getAbfsStore().openFileForRead(qualifiedPath, > parameters, statistics, tracingContext); > return new FSDataInputStream(inputStream); // <== FSDataInputStream is > not efficiently overriding readVectored() ! > } > {code} > see default implementation of FSDataInpustStream.readVectored: > {code:java} > public void readVectored(List<? extends FileRange> ranges, > IntFunction<ByteBuffer> allocate) throws IOException { > ((PositionedReadable)this.in).readVectored(ranges, allocate); > } > {code} > it calls the underlying method from class AbfsInputStream, which is not > overriden: > {code:java} > default void readVectored(List<? extends FileRange> ranges, > IntFunction<ByteBuffer> allocate) throws IOException { > VectoredReadUtils.readVectored(this, ranges, allocate); > } > {code} > AbfsInputStream should override this method, and accept internally to do less > Https calls, with merged range, and ignore some returned data (wholes in the > range). > It is like honouring the parameter of hadoop FSDataInputStream (implements > PositionedReadable) > {code:java} > /** > * What is the smallest reasonable seek? > * @return the minimum number of bytes > */ > default int minSeekForVectorReads() { > return 4 * 1024; > } > {code} > Even this 4096 value is very conservative, and should be redined by > AbfsFileSystem to be 4Mo or even 8mo. > ask chat gpt: "on Azure Storage, what is the speed of getting 8Mo of a page > block, compared to the time to establish a https tls handshake ?" > The response (untrusted from chat gpt..) says : > HTTPS/TLS Handshake: ~100–300 ms ... is generally slower than downloading 8 > MB from Page Blob: on Standard Tier: ~100–200 ms / on Premium Tier: ~30–50 ms > Azure Abfsclient already setup by default a lot of Threads for Prefecth Read > Ahead, to prefetch 4Mo of data, but it is NOT sufficent, and less efficient > that simply implementing correctly what is already in Hadoop API : > readVectored(). It also have the drawback of reading tons of useless data > (past parquet blocks), that are never used. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org