GitHub user StephanEwen opened a pull request: https://github.com/apache/flink/pull/2999
[FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent lost files/directories when called concurrently This is mainly relevant to tests and Local Mini Cluster executions. The `FileOutputFormat` and its subclasses rely on `FileSystem::initOutPathLocalFS()` to prepare the output directory. When multiple parallel output writers call that method, there is a slim chance that one parallel threads deletes the others directory. The checks that the method has are not bullet proof. I believe that this is the cause for many Travis test instabilities that we observed over time. Simply synchronizing that method per process should do the trick. Since it is a rare initialization method, and only relevant in tests & local mini cluster executions, it should be a price that is okay to pay. I see no other way, as we do not have simple access to an atomic "check and delete and recreate" file operation. The synchronization also makes many "re-try" code paths obsolete (there should be no re-tries needed on proper file systems). ### Tests This is tricky to test. The test in `InitOutputPathTest.java` uses a series of latch to re-produce the problematic thread execution interleaving to validate the problem. The properly fixed variant cannot use that interleaving (because it fixes the problem, duh), but pushes the thread interleaving best-effort towards the case where the problem would occur, were the method not properly synchronized. Sounds weird, I know. You can merge this pull request into a Git repository by running: $ git pull https://github.com/StephanEwen/incubator-flink fs_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2999.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2999 ---- commit 7a4d5b2f53e3b8c27a5224e405cf1b671f474a58 Author: Stephan Ewen <se...@apache.org> Date: 2016-12-13T18:15:11Z [tests] Add 'CheckedThread' as a common test utility commit e25b436f1de18be0dc2c3b02b82bf0b8203f0b44 Author: Stephan Ewen <se...@apache.org> Date: 2016-12-13T18:12:12Z [FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent lost files when called concurrently. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---