[GitHub] [hadoop] anujmodi2021 commented on pull request #6069: HADOOP-18910: [ABFS] Adding Support for MD5 Hash based integrity verification of the request content during transport

via GitHub Thu, 28 Sep 2023 02:53:46 -0700


anujmodi2021 commented on PR #6069:
URL: https://github.com/apache/hadoop/pull/6069#issuecomment-1738835515


   > I've actually been thinking of adding a similar option to the s3a client 
for third party non-https support.
   > 
   > In my head though, the generation of the upload MD5 hash could be done as 
data is written to the buffer/file in the org.apache.hadoop.fs.store.DataBlocks 
class, in the `org.apache.hadoop.fs.store.DataBlocks.DataBlock.write(byte[] 
buffer, int offset, int length)` call
   > 
   > * the data comes in as an array: no need to reload/copy
   > * its often intermingled with other work, so no end-of-block delays.
   > * if the application is mixing compute with write, it may not add any 
delay.
   > 
   > I would suggest you add it there as it means I'd switch to that class for 
the s3a output stream and pick up your work too: no duplicate code and better 
test coverage.
   > 
   > I guess the issue here is that abfs client appends in individual post 
requests, there's not enough of a match between DataBlock size and the http 
requests, except for very small files. Correct?
   > 
   > Propose you use a DurationTracker to actually count time spent processing 
md5 headers; can add a new IOStatistic to the store for this. This allows for 
the cost of enabling to be measured/reported.
   
   Hi steve. thanks for the review. For your query regarding whether the MD5 
computation should be moved to Datablocks.write,
   
   1. I think it won't help us reduce the cost of array copy. Today in 
production code when we call append from output stream, we always send the 
offset 0 and length as the length of Data block. So, there is in a way direct 
mapping of datablocks to append calls and this check is added in 
client.append() to avoid array copy if offset is 0. Array copy was added only 
if anytime in future we end up sending non-zero offset, we might need to do 
array copy. If that happens, then it would be wrong to compute MD5 hash in 
Datablocks,write() as whole datablock data won't be appended. 
   2. Also, Datablocks.write() call is done in Main thread and it would be 
computationally expensive to compute MD5 Hash of whole data in main thread as 
compared to the current approach where it is computed parallelly by worker 
threads doing appends.
   
   Let me know your thoughts on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[GitHub] [hadoop] anujmodi2021 commented on pull request #6069: HADOOP-18910: [ABFS] Adding Support for MD5 Hash based integrity verification of the request content during transport

Reply via email to