That's a pretty awesome improvement, congratulations! Would like to join the call too!
Bogdan On 30/08/2021, 02:57, "Bhalchandra Pandit" <kpan...@pinterest.com.INVALID> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Sorry for the late reply. Certainly. I will be happy to chat on zoom with those who are interested in it. To answer your questions: 1. I am in the Seattle area (PST). 2. I do have sufficient access to S3. I will be able to run tests or benchmarks as necessary. 3. I have not yet played with fadvise semantics. However, I will be happy to explore and also contribute help. Kumar On Thu, Aug 26, 2021 at 8:50 AM Steve Loughran <ste...@cloudera.com.invalid> wrote: > really nice piece of work! > > Would you be up to taking part in a zoom call open to all interested > developers where you talk about what you've done, and we can discuss what > to do next? It'd have to be the week after next, as too many of us s3a devs > are offline next week. What timezone are you in? > > Here's the patch process for S3A, > > https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/testing.html > > It's essentially > * if you aren't set up to test against S3 we don't have time to debug it > for you -and Yetus can't run them. > * knowing your test set up ensure that you are being honest about running > tests and helps us understand when things start failing, what difference in > test set-ups different people have. > > For input stream optimisation, I'd like this to (at least initially) go in > alongside the normal stream, so we don't break things there. > > I have a neglected PR which is designed to let the caller specify what read > policy they want for a file > https://github.com/apache/hadoop/pull/2584 > > for s3a if the caller asked for "parquet" or "orc" we'd switch to the new > stream; for "whole-file" it'd be parallel with big block prefetching > (32+MB) > > Also we're collecting lots of stats now in IOStatistics: if you call > Stream.toString() you get these. We'd want more here. > > > I note the document said that we don't parallelize reads (true, vectored IO > is still neglected), and that cost of seek is high because of need to > abort/negotiate the TLS connection. fs.s3a.experimental.input.fadvise = > random only does short block reads so doesn't need to abort, fadvise=normal > starts in sequential and switches to random on the first backwards seek. > have you played with these? > > > On Thu, 26 Aug 2021 at 03:33, Bhalchandra Pandit > <kpan...@pinterest.com.invalid> wrote: > > > Hi All, > > I work for Pinterest. I developed a technique for vastly improving read > > throughput when reading from the S3 file system. It not only helps the > > sequential read case (like reading a SequenceFile) but also significantly > > improves read throughput of a random access case (like reading Parquet). > > This technique has been very useful in significantly improving efficiency > > of the data processing jobs at Pinterest. > > > > I would like to contribute that feature to Apache Hadoop. More details on > > this technique are available in this blog I wrote recently: > > > > > https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0 > > > > I would like to know if you believe it to be a useful contribution. If > so, > > I will follow the steps outlined on the how to contribute > > <https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute> > > page. > > > > Kumar > > > --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org