anujmodi2021 commented on PR #6069: URL: https://github.com/apache/hadoop/pull/6069#issuecomment-1738835515
> I've actually been thinking of adding a similar option to the s3a client for third party non-https support. > > In my head though, the generation of the upload MD5 hash could be done as data is written to the buffer/file in the org.apache.hadoop.fs.store.DataBlocks class, in the `org.apache.hadoop.fs.store.DataBlocks.DataBlock.write(byte[] buffer, int offset, int length)` call > > * the data comes in as an array: no need to reload/copy > * its often intermingled with other work, so no end-of-block delays. > * if the application is mixing compute with write, it may not add any delay. > > I would suggest you add it there as it means I'd switch to that class for the s3a output stream and pick up your work too: no duplicate code and better test coverage. > > I guess the issue here is that abfs client appends in individual post requests, there's not enough of a match between DataBlock size and the http requests, except for very small files. Correct? > > Propose you use a DurationTracker to actually count time spent processing md5 headers; can add a new IOStatistic to the store for this. This allows for the cost of enabling to be measured/reported. Hi steve. thanks for the review. For your query regarding whether the MD5 computation should be moved to Datablocks.write, 1. I think it won't help us reduce the cost of array copy. Today in production code when we call append from output stream, we always send the offset 0 and length as the length of Data block. So, there is in a way direct mapping of datablocks to append calls and this check is added in client.append() to avoid array copy if offset is 0. Array copy was added only if anytime in future we end up sending non-zero offset, we might need to do array copy. If that happens, then it would be wrong to compute MD5 hash in Datablocks,write() as whole datablock data won't be appended. 2. Also, Datablocks.write() call is done in Main thread and it would be computationally expensive to compute MD5 Hash of whole data in main thread as compared to the current approach where it is computed parallelly by worker threads doing appends. Let me know your thoughts on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org