GitHub user NicoK opened a pull request: https://github.com/apache/flink/pull/4402
[FLINK-7261][blob] extend BlobStore#get/put with boolean return values ## What is the purpose of the change We'd like the `PermanentBlobCache` to be able to distinguish between HA and non-HA cases to reduce superfluous error messages for non-HA setups (through `VoidBlobStore`) but keep the general path of trying to a file from a HA `FileSystemBlobStore` first, before downloading from the `BlobServer`. ## Brief change log - Extend `BlobStore#get` and `BlobStore#put` with boolean return values that indicate whether the operation was performed (successfully). - Make use of the return value inside `PermanentBlobCache` to check whether the HA filesystem download was successful and continue moving to a persisted file. Please note that this PR is based on #4381 in a series to implement FLIP-19. ## Verifying this change This change is already covered by existing tests, such as `BlobCacheGetTest` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **(no)** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **(no)** - The serializers: **(no)** - The runtime per-record code paths (performance sensitive): **(no)** - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **(yes)**: ## Documentation - Does this pull request introduce a new feature? **(no)** - If yes, how is the feature documented? **(JavaDocs)** You can merge this pull request into a Git repository by running: $ git pull https://github.com/NicoK/flink flink-6916-7261 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4402.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4402 ---- commit 1f86efd013a81e84ba1556de0a04e4ac70229f79 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-22T15:31:17Z [FLINK-7053][blob] remove code duplication in BlobClientSslTest This lets BlobClientSslTest extend BlobClientTest as most of its implementation came from there and was simply copied. commit ff083e850edb8f5f383f54f19519116e10308d61 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-23T09:40:34Z [FLINK-7053][blob] verify some of the buffers returned by GET commit e138569019f5c75b70a21085e4829cb6cd1e93bc Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-23T10:04:10Z [FLINK-7053][blob] use TemporaryFolder for local BLOB dir in unit tests This replaces the use of some temporary directory where it is not guaranteed that it will be deleted after the test. commit 70f06d9b38c5d8afb50b6c18b7b3a4e07e2bb3da Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-21T12:45:31Z [FLINK-7054][blob] remove LibraryCacheManager#getFile() This was only used in tests where it is avoidable but if used anywhere else, it may have caused cleanup issues. commit 5463f65aeb3ed519b51ccc78abfcc003bf02b3f8 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-16T21:13:56Z [FLINK-7054][hotfix] fix a checkstyle error commit 12052c0b8c8b869f9b6bd6f9d53c9ed6e362c631 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-21T14:14:15Z [FLINK-7055][blob] refactor getURL() to the more generic getFile() The fact that we always returned URL objects is a relic of the BlobServer's only use for URLClassLoader. Since we'd like to extend its use, returning File objects instead is more generic. commit 9dd7ba652ff1b09fb5bf44e8b6b5b885a56873ab Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-21T16:04:43Z [FLINK-7056][blob] add API to allow job-related BLOBs to be stored commit 90f779ba66728979977bd794e8f68b207471dbc2 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-23T17:17:07Z [FLINK-7056][blob] refactor the new API for job-related BLOBs For a cleaner API, instead of having a nullable jobId parameter, use two methods: one for job-related BLOBs, another for job-unrelated ones. commit 6d1fda80c6f9869676030fd86177bbc7e981b1f9 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-16T21:22:12Z [FLINK-7056][hotfix] fix a checkstyle error commit cdbb0fb32052c8163bc83df151a0c8354846e4dc Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-27T16:29:44Z [FLINK-7057][blob] move ref-counting from the LibraryCacheManager to the BlobCache Also change from BlobKey-based ref-counting to job-based ref-counting which is simpler and the mode we want to use from now on. Deferred cleanup (as before) is currently not implemented yet (TODO). At the BlobServer, no ref-counting will be used but the cleanup will happen when the job enters a final state (TODO). commit 8617c9e8ffbdece8d02815ef71dc09f2d764453a Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-28T09:31:39Z [FLINK-7057][blob] change to a cleaner API for BlobService#registerJob() commit 20cf9bf1fd2b382b6d45dc7b15c7be9bed7f5f16 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-28T12:09:11Z [FLINK-7057][blob] implement deferred cleanup at the BlobCache Whenever a job is not referenced at the BlobCache anymore, we set a TTL and let the cleanup task remove it when this is hit and the task is run. For now, this means that a BLOB will be retained at most (2 * ConfigConstants.LIBRARY_CACHE_MANAGER_CLEANUP_INTERVAL) seconds after not being referenced anymore. We do this so that a recovery still has the chance to use existing files rather than to download them again. commit 270f3957b6d5e9061d528158a86eb1837ad2786c Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-28T15:17:06Z [FLINK-7057][blob] integrate cleanup of job-related JARs from the BlobServer TODO: an integration test that verifies that this is actually done when desired and not performed when not, e.g. if the job did not reach a final execution state commit 39300a32c90c82f8044cfd6d5fcaf1b83b250c87 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-30T12:52:19Z [FLINK-7057][tests] extract FailingBlockingInvokable from CoordinatorShutdownTest commit 56113349799669326074d269d6c07bb19bd45788 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-30T12:56:14Z [FLINK-7057][blob] add an integration test for the BlobServer cleanup This ensures that BLOB files are actually deleted when a job enters a final state. commit 93861f1f624c5b514fcdd041b592e0c40c66d103 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-03T09:25:29Z [FLINK-7057][tests] refrain from catching an exception just to fail the test removes code like this in the BLOB store unit tests: catch (Exception e) { e.printStackTrace(); fail(e.getMessage()); } commit 80a05142a68f8a3d0aaf1d447ab87a1e506364a4 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-03T11:45:33Z [FLINK-7057][blob] fix BlobServer#cleanupJob() being too eager Instead of deleting the job's directory, it was deleting the parent storage directory. commit de4ca2a7c676e7e1a82172ce680ae5a5f6693df4 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-03T15:03:50Z [FLINK-7057][blob] fix BlobServer cleanup integration * the test did not check the correct directories for cleanup * the test did not honour the test timeout commit dc58fa5d5ffc299b493dc1a78f26aa1914036151 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-03T15:11:18Z [FLINK-7057][blob] test and fix BlobServer cleanup for a failed job submission commit d1de87bfd34c5bb1f2dde0402bcceac6013c150b Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T07:55:29Z [FLINK-7057][blob] rework the LibraryCacheManager API Since ref-counting has moved to the BlobCache, the BlobLibraryCacheManager is just a thin wrapper to get a user class loader by retrieving BLOBs from the BlobCache/BlobServer. Therefore, move the job-registration/-release out of it, too, and restrict its use to the task manager where the BlobCache is used (on the BlobServer, jobs do not need registration since they are only used once and will be deleted when they enter a final state). This makes the BlobServer and BlobCache instances available at the JobManager and TaskManager instances, respectively, also enabling future use cases outside of the LibraryCacheManager. commit b62f1f640e33126523eaaa7bd51f0b24962c67fd Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T09:50:07Z [hotfix] increase Scala checkstyle maxParameters to 20 commit 5e9a9a7cd779a8e0d28947bcfa64cc16c02857f2 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T09:51:28Z [FLINK-7057][blob] address PR comments commit 5fa960676281fc1473105c849fbce9cb22cacacb Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T12:41:18Z [FLINK-7057][blob] fix JobManagerLeaderElectionTest commit a6ca6812af7b4130e77775fd3896f3f44a357334 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T20:49:12Z [FLINK-7057][blob] re-introduce some ref-counting for BlobLibraryCacheManager Apparently, we do need to return the same ClassLoader for different (parallel) tasks of a job running on the same task manager. Therefore, keep the initial task registration implementation that was removed with 8331fbb208d975e0c1ec990344c14315ea08dd4a and only adapt it here. This also restores some tests and adds new combinations not tested before. commit da1be581529813ec359f1ac09a00bc564a3812dd Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-06T07:38:17Z [FLINK-7057][blob] address PR comments commit b02807e75cea960fd2c89e49656be2e6f6e32394 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-10T11:40:00Z [FLINK-7057][tests] fix (manual/ignored) BlobCacheCleanupTest#testJobDeferredCleanup() commit 0f9f7d7f90497708cd2589b34573f10be0b9d331 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-16T21:30:43Z [FLINK-7057][hotfix] fix a checkstyle error commit ba540ae1445a8144aff389c5056f34b54a7cbbdb Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-20T09:16:38Z [FLINK-7057][blob] remove the extra lock object from BlobCache We can lock on jobRefCounters instead, which is what we are guarding anyway. commit 871a590b5c1b3897e26e516e027a86345e3a3f4f Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-20T09:17:21Z [FLINK-7057][blob] minor improvements to the TTL in BlobCache Do not use Long.MAX_VALUE as a code for "keep forever". Also add more comments. commit c48575fe81367810f2fd5ae082170f349d3e7176 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-20T15:12:24Z [FLINK-7057][blob] replace "library-cache-manager.cleanup.interval" with "blob.service.cleanup.interval" Since we moved the cleanup to the BLOB service classes, this only makes sense. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---