[jira] [Created] (HADOOP-16241) S3AInputStream PositionReadable should perform ranged read on dedicated stream

Sahil Takiar (JIRA) Mon, 08 Apr 2019 13:47:14 -0700

Sahil Takiar created HADOOP-16241:
-------------------------------------

             Summary: S3AInputStream PositionReadable should perform ranged 
read on dedicated stream 
                 Key: HADOOP-16241
                 URL: https://issues.apache.org/jira/browse/HADOOP-16241
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs/s3
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar



The current implementation of {{PositionReadable}} in {{S3AInputStream}} is 
pretty close to the default implementation in {{FsInputStream}}.

This JIRA proposes overriding the {{read(long position, byte[] buffer, int 
offset, int length)}} method and re-implementing the {{readFully(long position, 
byte[] buffer, int offset, int length)}} method in S3A.

The new implementation would perform a "ranged read" on a dedicated object 
stream (rather than the shared one). Prototypes have shown this to bring a 
considerable performance improvement to readers who are only interested in 
reading a random chunk of the file at a time (e.g. Impala, although I would 
assume HBase would benefit from this as well).

Setting {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} is helpful for 
clients that rely on pread, but has a few drawbacks:
 * Unless the client explicitly sets fadvise to RANDOM, they will get at least 
one connection reset when the backwards seek is issued (after which fadvise 
automatically switches to RANDOM)
 * Data is only read in 64 kb chunks, so for a large read, several GET requests 
must be issued to S3 to fetch the data; while the 64 kb chunk value is 
configurable, it is hard to set a reasonable value for variable length preads
 * If the readahead value is too big, closing the input stream can take 
considerable time because the stream has to be drained of data before it can be 
closed

The new implementation of {{PositionReadable}} would issue a 
{{GetObjectRequest}} with the range specified by {{position}} and the size of 
the given buffer. The data would be read from the {{S3ObjectInputStream}} and 
then closed at the end of the method. This stream would be independent of the 
{{wrappedStream}} currently maintained by S3A.

This brings the following benefits:
 * The {{PositionedReadable}} methods can be thread-safe without a 
{{synchronized}} block, which allows clients to concurrently call pread methods 
on the same {{S3AInputStream}} instance
 * preads will request all the data at once rather than requesting it in chunks 
via the readahead logic
 * Avoids performing potentially expensive seeks when performing preads



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-16241) S3AInputStream PositionReadable should perform ranged read on dedicated stream

Reply via email to