[ 
https://issues.apache.org/jira/browse/HADOOP-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945280#comment-17945280
 ] 

ASF GitHub Bot commented on HADOOP-19543:
-----------------------------------------

manika137 commented on code in PR #7614:
URL: https://github.com/apache/hadoop/pull/7614#discussion_r2048453607


##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatus.java:
##########
@@ -532,6 +532,88 @@ public void testEmptyContinuationToken() throws Exception {
         .describedAs("Listing Size Not as expected").hasSize(1);
   }
 
+  /**
+   * Test to verify that listStatus returns the correct file status
+   * after removing duplicates across multiple iterations of list blobs.
+   * Also verifies that in case of non-empty explicit dir,
+   * entry corresponding to marker blob is returned.
+   * @throws Exception if test fails.
+   */
+  @Test
+  public void testDuplicateEntriesAcrossListBlobIterations() throws Exception {
+    AzureBlobFileSystem fs = Mockito.spy(getFileSystem());
+    AzureBlobFileSystemStore store = Mockito.spy(fs.getAbfsStore());
+    store.getAbfsConfiguration().setListMaxResults(1);
+    AbfsClient client = Mockito.spy(store.getClient());
+
+    Mockito.doReturn(store).when(fs).getAbfsStore();
+    Mockito.doReturn(client).when(store).getClient();
+
+    /*
+     * Following entries will be created inside the root path.
+     * 0. /A - implicit directory without any marker blob
+     * 1. /a - marker file for explicit directory
+     * 2. /a/file1 - normal file inside explicit directory
+     * 3. /b - normal file inside root
+     * 4. /c - marker file for explicit directory
+     * 5. /c.bak - marker file for explicit directory
+     * 6. /c.bak/file2 - normal file inside explicit directory
+     * 7. /c/file3 - normal file inside explicit directory
+     * 8. /d - implicit directory
+     * 9. /e - marker file for explicit directory
+     * 10. /e/file4 - normal file inside explicit directory
+     */
+    // Create Path 0
+    createAzCopyFolder(new Path("/A"));
+
+    // Create Path 1 and 2.
+    fs.create(new Path("/a/file1"));
+
+    // Create Path 3
+    fs.create(new Path("/b"));
+
+    // Create Path 4 and 7
+    fs.create(new Path("/c/file3"));
+
+    // Create Path 5 and 6
+    fs.create(new Path("/c.bak/file2"));
+
+    // Create Path 8
+    createAzCopyFolder(new Path("/d"));
+
+    // Create Path 9 and 10
+    fs.create(new Path("/e/file4"));
+
+    FileStatus[] fileStatuses = fs.listStatus(new Path(ROOT_PATH));
+
+    // Assert that client.listPath was called 11 times.
+    // This will assert server returned 11 entries in total.
+    Mockito.verify(client, Mockito.times(11))
+        .listPath(eq(ROOT_PATH), eq(false), eq(1), any(), any(), any());
+
+    // Assert that after duplicate removal, only 7 unique entries are returned.
+    Assertions.assertThat(fileStatuses.length)
+        .describedAs("List size is not expected").isEqualTo(7);
+
+    // Assert that for duplicates, entry corresponding to marker blob is 
returned.
+    assertImplicitDirectoryFileStatus(fileStatuses[0], fs.makeQualified(new 
Path("/A")));
+    assertExplicitDirectoryFileStatus(fileStatuses[1], fs.makeQualified(new 
Path("/a")));
+    assertFilePathFileStatus(fileStatuses[2], fs.makeQualified(new 
Path("/b")));
+    assertExplicitDirectoryFileStatus(fileStatuses[3], fs.makeQualified(new 
Path("/c")));
+    assertExplicitDirectoryFileStatus(fileStatuses[4], fs.makeQualified(new 
Path("/c.bak")));
+    assertImplicitDirectoryFileStatus(fileStatuses[5], fs.makeQualified(new 
Path("/d")));
+    assertExplicitDirectoryFileStatus(fileStatuses[6], fs.makeQualified(new 
Path("/e")));
+
+    // Assert that there are no duplicates in the returned file statuses.
+    for (int i = 0; i < fileStatuses.length; i++) {

Review Comment:
   could we use a set here instead of a double loop?
   something like- 
   Set<Path> uniquePaths = new HashSet<>();
   for (FileStatus fileStatus : fileStatuses) {
       Assertions.assertThat(uniquePaths.add(fileStatus.getPath()))
           .describedAs("Duplicate entries found")
           .isTrue();
   }





> ABFS: [FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across 
> Iterations
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-19543
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19543
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.5.0, 3.4.1
>            Reporter: Anuj Modi
>            Assignee: Anuj Modi
>            Priority: Blocker
>              Labels: pull-request-available
>
> On FNS-Blob, List Blobs API is known to return duplicate entries for the 
> non-empty explicit directories. One entry corresponds to the directory itself 
> and another entry corresponding to the marker blob that driver internally 
> creates and maintains to mark that path as a directory. We already know about 
> this behaviour and it was handled to remove such duplicate entries from the 
> set of entries that were returned as part current list iterations.
> Due to possible partition split if such duplicate entries happen to be 
> returned in separate iteration, there is no handling on this and caller might 
> get back the result with duplicate entries as happening in this case. The 
> logic to remove duplicate was designed before the realization of partition 
> split came.
> This PR fixes this bug



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to