[
https://issues.apache.org/jira/browse/HADOOP-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945280#comment-17945280
]
ASF GitHub Bot commented on HADOOP-19543:
-----------------------------------------
manika137 commented on code in PR #7614:
URL: https://github.com/apache/hadoop/pull/7614#discussion_r2048453607
##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatus.java:
##########
@@ -532,6 +532,88 @@ public void testEmptyContinuationToken() throws Exception {
.describedAs("Listing Size Not as expected").hasSize(1);
}
+ /**
+ * Test to verify that listStatus returns the correct file status
+ * after removing duplicates across multiple iterations of list blobs.
+ * Also verifies that in case of non-empty explicit dir,
+ * entry corresponding to marker blob is returned.
+ * @throws Exception if test fails.
+ */
+ @Test
+ public void testDuplicateEntriesAcrossListBlobIterations() throws Exception {
+ AzureBlobFileSystem fs = Mockito.spy(getFileSystem());
+ AzureBlobFileSystemStore store = Mockito.spy(fs.getAbfsStore());
+ store.getAbfsConfiguration().setListMaxResults(1);
+ AbfsClient client = Mockito.spy(store.getClient());
+
+ Mockito.doReturn(store).when(fs).getAbfsStore();
+ Mockito.doReturn(client).when(store).getClient();
+
+ /*
+ * Following entries will be created inside the root path.
+ * 0. /A - implicit directory without any marker blob
+ * 1. /a - marker file for explicit directory
+ * 2. /a/file1 - normal file inside explicit directory
+ * 3. /b - normal file inside root
+ * 4. /c - marker file for explicit directory
+ * 5. /c.bak - marker file for explicit directory
+ * 6. /c.bak/file2 - normal file inside explicit directory
+ * 7. /c/file3 - normal file inside explicit directory
+ * 8. /d - implicit directory
+ * 9. /e - marker file for explicit directory
+ * 10. /e/file4 - normal file inside explicit directory
+ */
+ // Create Path 0
+ createAzCopyFolder(new Path("/A"));
+
+ // Create Path 1 and 2.
+ fs.create(new Path("/a/file1"));
+
+ // Create Path 3
+ fs.create(new Path("/b"));
+
+ // Create Path 4 and 7
+ fs.create(new Path("/c/file3"));
+
+ // Create Path 5 and 6
+ fs.create(new Path("/c.bak/file2"));
+
+ // Create Path 8
+ createAzCopyFolder(new Path("/d"));
+
+ // Create Path 9 and 10
+ fs.create(new Path("/e/file4"));
+
+ FileStatus[] fileStatuses = fs.listStatus(new Path(ROOT_PATH));
+
+ // Assert that client.listPath was called 11 times.
+ // This will assert server returned 11 entries in total.
+ Mockito.verify(client, Mockito.times(11))
+ .listPath(eq(ROOT_PATH), eq(false), eq(1), any(), any(), any());
+
+ // Assert that after duplicate removal, only 7 unique entries are returned.
+ Assertions.assertThat(fileStatuses.length)
+ .describedAs("List size is not expected").isEqualTo(7);
+
+ // Assert that for duplicates, entry corresponding to marker blob is
returned.
+ assertImplicitDirectoryFileStatus(fileStatuses[0], fs.makeQualified(new
Path("/A")));
+ assertExplicitDirectoryFileStatus(fileStatuses[1], fs.makeQualified(new
Path("/a")));
+ assertFilePathFileStatus(fileStatuses[2], fs.makeQualified(new
Path("/b")));
+ assertExplicitDirectoryFileStatus(fileStatuses[3], fs.makeQualified(new
Path("/c")));
+ assertExplicitDirectoryFileStatus(fileStatuses[4], fs.makeQualified(new
Path("/c.bak")));
+ assertImplicitDirectoryFileStatus(fileStatuses[5], fs.makeQualified(new
Path("/d")));
+ assertExplicitDirectoryFileStatus(fileStatuses[6], fs.makeQualified(new
Path("/e")));
+
+ // Assert that there are no duplicates in the returned file statuses.
+ for (int i = 0; i < fileStatuses.length; i++) {
Review Comment:
could we use a set here instead of a double loop?
something like-
Set<Path> uniquePaths = new HashSet<>();
for (FileStatus fileStatus : fileStatuses) {
Assertions.assertThat(uniquePaths.add(fileStatus.getPath()))
.describedAs("Duplicate entries found")
.isTrue();
}
> ABFS: [FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across
> Iterations
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-19543
> URL: https://issues.apache.org/jira/browse/HADOOP-19543
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Affects Versions: 3.5.0, 3.4.1
> Reporter: Anuj Modi
> Assignee: Anuj Modi
> Priority: Blocker
> Labels: pull-request-available
>
> On FNS-Blob, List Blobs API is known to return duplicate entries for the
> non-empty explicit directories. One entry corresponds to the directory itself
> and another entry corresponding to the marker blob that driver internally
> creates and maintains to mark that path as a directory. We already know about
> this behaviour and it was handled to remove such duplicate entries from the
> set of entries that were returned as part current list iterations.
> Due to possible partition split if such duplicate entries happen to be
> returned in separate iteration, there is no handling on this and caller might
> get back the result with duplicate entries as happening in this case. The
> logic to remove duplicate was designed before the realization of partition
> split came.
> This PR fixes this bug
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]