Re: improving efficiency and reducing runtime using S3 read optimization

Stolojan, Bogdan Tue, 31 Aug 2021 03:27:09 -0700

That's a pretty awesome improvement, congratulations!
Would like to join the call too!


Bogdan

On 30/08/2021, 02:57, "Bhalchandra Pandit" <kpan...@pinterest.com.INVALID> 
wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    Sorry for the late reply. Certainly. I will be happy to chat on zoom with
    those who are interested in it.

    To answer your questions:

       1. I am in the Seattle area (PST).
       2. I do have sufficient access to S3. I will be able to run tests or
       benchmarks as necessary.
       3. I have not yet played with fadvise semantics. However, I will be
       happy to explore and also contribute help.

    Kumar

    On Thu, Aug 26, 2021 at 8:50 AM Steve Loughran <ste...@cloudera.com.invalid>
    wrote:

    > really nice piece of work!
    >
    > Would you be up to taking part in a zoom call open to all interested
    > developers where you talk about what you've done, and we can discuss what
    > to do next? It'd have to be the week after next, as too many of us s3a 
devs
    > are offline next week. What timezone are you in?
    >
    > Here's the patch process for S3A,
    >
    > 
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/testing.html
    >
    > It's essentially
    > * if you aren't set up to test against S3 we don't have time to debug it
    > for you -and Yetus can't run them.
    > * knowing your test set up ensure that you are being honest about running
    > tests and helps us understand when things start failing, what difference 
in
    > test set-ups different people have.
    >
    > For input stream optimisation, I'd like this to (at least initially) go in
    > alongside the normal stream, so we don't break things there.
    >
    > I have a neglected PR which is designed to let the caller specify what 
read
    > policy they want for a file
    > https://github.com/apache/hadoop/pull/2584
    >
    > for s3a if the caller asked for "parquet" or "orc" we'd switch to the new
    > stream; for "whole-file" it'd be parallel with big block prefetching
    > (32+MB)
    >
    > Also we're collecting lots of stats now in IOStatistics: if you call
    > Stream.toString() you get these. We'd want more here.
    >
    >
    > I note the document said that we don't parallelize reads (true, vectored 
IO
    > is still neglected), and that cost of seek is high because of need to
    > abort/negotiate the TLS connection. fs.s3a.experimental.input.fadvise =
    > random only does short block reads so doesn't need to abort, 
fadvise=normal
    > starts in sequential and switches to random on the first backwards seek.
    > have you played with these?
    >
    >
    > On Thu, 26 Aug 2021 at 03:33, Bhalchandra Pandit
    > <kpan...@pinterest.com.invalid> wrote:
    >
    > > Hi All,
    > > I work for Pinterest. I developed a technique for vastly improving read
    > > throughput when reading from the S3 file system. It not only helps the
    > > sequential read case (like reading a SequenceFile) but also 
significantly
    > > improves read throughput of a random access case (like reading Parquet).
    > > This technique has been very useful in significantly improving 
efficiency
    > > of the data processing jobs at Pinterest.
    > >
    > > I would like to contribute that feature to Apache Hadoop. More details 
on
    > > this technique are available in this blog I wrote recently:
    > >
    > >
    > 
https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0
    > >
    > > I would like to know if you believe it to be a useful contribution. If
    > so,
    > > I will follow the steps outlined on the how to contribute
    > > <https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute>
    > > page.
    > >
    > > Kumar
    > >
    >


---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Re: improving efficiency and reducing runtime using S3 read optimization

Reply via email to