[ https://issues.apache.org/jira/browse/HADOOP-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769521#comment-17769521 ]
ASF GitHub Bot commented on HADOOP-18910: ----------------------------------------- steveloughran commented on code in PR #6069: URL: https://github.com/apache/hadoop/pull/6069#discussion_r1338307275 ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java: ########## @@ -88,7 +91,8 @@ public class AbfsClient implements Closeable { public static final Logger LOG = LoggerFactory.getLogger(AbfsClient.class); public static final String HUNDRED_CONTINUE_USER_AGENT = SINGLE_WHITE_SPACE + HUNDRED_CONTINUE + SEMICOLON; - + public static final String MD5_ERROR = "The MD5 value specified in the request " Review Comment: should be placed in AbfsConstants, so it's more visible, javadocs to explain purpose. and that it comes with a 400 ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java: ########## @@ -1412,6 +1444,102 @@ private void appendIfNotEmpty(StringBuilder sb, String regEx, } } + /** + * Add MD5 hash as request header to the append request + * @param requestHeaders + * @param reqParams + * @param buffer + */ + private void addCheckSumHeaderForWrite(List<AbfsHttpHeader> requestHeaders, + final AppendRequestParameters reqParams, final byte[] buffer) + throws AbfsRestOperationException { + try { + MessageDigest md5Digest = MessageDigest.getInstance(MD5); + byte[] dataToBeWritten = new byte[reqParams.getLength()]; + + if (reqParams.getoffset() == 0 && reqParams.getLength() == buffer.length) { + dataToBeWritten = buffer; + } else { + System.arraycopy(buffer, reqParams.getoffset(), dataToBeWritten, 0, + reqParams.getLength()); + } + + byte[] md5Bytes = md5Digest.digest(dataToBeWritten); + String md5Hash = Base64.getEncoder().encodeToString(md5Bytes); + requestHeaders.add(new AbfsHttpHeader(CONTENT_MD5, md5Hash)); + } catch (NoSuchAlgorithmException ex) { + throw new AbfsRuntimeException(ex); + } + } + + /** + * To verify the checksum information received from server for the data read + * @param buffer stores the data received from server + * @param result HTTP Operation Result + * @param bufferOffset Position where data returned by server is saved in buffer + * @throws AbfsRestOperationException + */ + private void verifyCheckSumForRead(final byte[] buffer, + final AbfsHttpOperation result, final int bufferOffset) + throws AbfsRestOperationException { + // Number of bytes returned by server could be less than or equal to what + // caller requests. In case it is less, extra bytes will be initialized to 0 + // Server returned MD5 Hash will be computed on what server returned. + // We need to get exact data that server returned and compute its md5 hash + // Computed hash should be equal to what server returned + int numberOfBytesRead = (int)result.getBytesReceived(); + if (numberOfBytesRead == 0) { + return; + } + byte[] dataRead = new byte[numberOfBytesRead]; + + if (bufferOffset == 0 && numberOfBytesRead == buffer.length) { + dataRead = buffer; + } else { + System.arraycopy(buffer, bufferOffset, dataRead, 0, numberOfBytesRead); + } + + try { + MessageDigest md5Digest = MessageDigest.getInstance(MD5); Review Comment: this creation/exception handling is repeated a lot. pull it out. maybe even create one per input stream ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java: ########## @@ -861,6 +875,12 @@ private boolean checkUserError(int responseStatusCode) { && responseStatusCode < HttpURLConnection.HTTP_INTERNAL_ERROR); } + private boolean isMd5ChecksumError(final AzureBlobFileSystemException e) { Review Comment: if you make this static and package private you can add a unit test to verify that it is handled +add javadocs ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java: ########## @@ -1412,6 +1444,102 @@ private void appendIfNotEmpty(StringBuilder sb, String regEx, } } + /** + * Add MD5 hash as request header to the append request + * @param requestHeaders + * @param reqParams + * @param buffer + */ + private void addCheckSumHeaderForWrite(List<AbfsHttpHeader> requestHeaders, + final AppendRequestParameters reqParams, final byte[] buffer) + throws AbfsRestOperationException { + try { + MessageDigest md5Digest = MessageDigest.getInstance(MD5); + byte[] dataToBeWritten = new byte[reqParams.getLength()]; + + if (reqParams.getoffset() == 0 && reqParams.getLength() == buffer.length) { + dataToBeWritten = buffer; + } else { + System.arraycopy(buffer, reqParams.getoffset(), dataToBeWritten, 0, + reqParams.getLength()); + } + + byte[] md5Bytes = md5Digest.digest(dataToBeWritten); + String md5Hash = Base64.getEncoder().encodeToString(md5Bytes); + requestHeaders.add(new AbfsHttpHeader(CONTENT_MD5, md5Hash)); + } catch (NoSuchAlgorithmException ex) { + throw new AbfsRuntimeException(ex); + } + } + + /** + * To verify the checksum information received from server for the data read + * @param buffer stores the data received from server + * @param result HTTP Operation Result + * @param bufferOffset Position where data returned by server is saved in buffer + * @throws AbfsRestOperationException + */ + private void verifyCheckSumForRead(final byte[] buffer, + final AbfsHttpOperation result, final int bufferOffset) + throws AbfsRestOperationException { + // Number of bytes returned by server could be less than or equal to what + // caller requests. In case it is less, extra bytes will be initialized to 0 + // Server returned MD5 Hash will be computed on what server returned. + // We need to get exact data that server returned and compute its md5 hash + // Computed hash should be equal to what server returned + int numberOfBytesRead = (int)result.getBytesReceived(); + if (numberOfBytesRead == 0) { + return; + } + byte[] dataRead = new byte[numberOfBytesRead]; + + if (bufferOffset == 0 && numberOfBytesRead == buffer.length) { + dataRead = buffer; + } else { + System.arraycopy(buffer, bufferOffset, dataRead, 0, numberOfBytesRead); + } + + try { + MessageDigest md5Digest = MessageDigest.getInstance(MD5); Review Comment: can this be isolated/made static for unit tests? ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java: ########## @@ -761,6 +764,11 @@ public AbfsRestOperation append(final String path, final byte[] buffer, requestHeaders.add(new AbfsHttpHeader(USER_AGENT, userAgentRetry)); } + // Add MD5 Hash of request content as request header if feature is enabled + if (isChecksumValidationEnabled()) { Review Comment: ok, so Md5Mismatch is the stable error code to look for? > ABFS: Adding Support for MD5 Hash based integrity verification of the request > content during transport > ------------------------------------------------------------------------------------------------------- > > Key: HADOOP-18910 > URL: https://issues.apache.org/jira/browse/HADOOP-18910 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/azure > Reporter: Anuj Modi > Assignee: Anuj Modi > Priority: Major > Labels: pull-request-available > > Azure Storage Supports Content-MD5 Request Headers in Both Read and Append > APIs. > Read: [Path - Read - REST API (Azure Storage Services) | Microsoft > Learn|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/read] > Append: [Path - Update - REST API (Azure Storage Services) | Microsoft > Learn|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/update] > This change is to make client-side changes to support them. In Read request, > we will send the appropriate header in response to which server will return > the MD5 Hash of the data it sends back. On Client we will tally this with the > MD5 hash computed from the data received. > In Append request, we will compute the MD5 Hash of the data that we are > sending to the server and specify that in appropriate header. Server on > finding that header will tally this with the MD5 hash it will compute on the > data received. > This whole Checksum Validation Support is guarded behind a config, Config is > by default disabled because with the use of "https" integrity of data is > preserved anyways. This is introduced as an additional data integrity check > which will have a performance impact as well. > Users can decide if they want to enable this or not by setting the following > config to *"true"* or *"false"* respectively. *Config: > "fs.azure.enable.checksum.validation"* -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org