[ https://issues.apache.org/jira/browse/SOLR-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432598#comment-17432598 ]
Houston Putman commented on SOLR-15702: --------------------------------------- Ok so that previous solution did not fix the tests, in fact it made them worse (by removing the hack fix used previously). After deeper investigation, I believe the issue is not around S3Mock having delayed state (and thus not properly showing files existing). Instead the S3Repository does not adhere to the API Guidelines for {{createDirectory(URI)}}, namely that it should not recreate directories that already exist. The Solr backup commands call a {{createDirectory}} at various times, while also checking that the same directory exists (via {{pathExists}}) at different times. The issue here is that during the distributed Backup commands, (sent to different nodes for each shard), one node might be at the part of the backup where it is calling {{createDirectory}} when another node is calling {{pathExists}} for the same directory. In this instance, the S3Mock code fails because it does not properly lock files for concurrent read/write. We can easily fix this by making {{S3Repository.createDirectory(URI)}} respect the API guidelines, and check that the directory does not exist before trying to create it. Sidenote, I notice that the GCSRepository also has this problem. Will raise a separate issue for it. > Stabilize S3 Tests > ------------------ > > Key: SOLR-15702 > URL: https://issues.apache.org/jira/browse/SOLR-15702 > Project: Solr > Issue Type: Test > Components: contrib - S3 Repository > Affects Versions: 8.10 > Reporter: Houston Putman > Assignee: Houston Putman > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Currently all 3 tests in the {{S3IncrementalBackupTest}} fail sporadically > (between 1-5% of the time on the jenkins instances). > From my investigations, these failures mainly happen because S3Repository > will create a directory node, and then assume it exists later on. However, > the S3Mock may be too slow and later on when the assumption is made that this > directory node exists, S3Mock is returning that the directory node does not > exist. > The first fix for this was to just always check twice to see if a node is > there, kind of hacky but it gives the S3Repository one more chance to find > the right information. > However in the v2 S3 API, there is a concept of > [waiters|https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/using.html#using-use-waiters], > which will allow us to wait until S3 verifies the state we are looking for > (i.e. that the directory node exists). We can either put this waiter in after > creating the node, and not return until the waiter says the node is created. > Or we can put it in when checking whether the node exists. I think the former > is preferable, but we can do testing to see which actually preforms better. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org