GitHub user NicoK opened a pull request: https://github.com/apache/flink/pull/4358
[FLINK-7068][blob] change BlobService sub-classes for permanent and transient BLOBs This splits up the BLOB store classes into two different use cases which is the biggest change of FLIP-19: * a `PermanentBlobStore` should resemble use cases for BLOBs that are permanently stored for a job's life time (HA and non-HA). * a `TransientBlobStore` should reflect BLOB offloading for logs, RPC, etc. which even does not have to be reflected by files. The `BlobServer` implements both while the `BlobCache` provides access to individual implementations of the two. HA files may only be uploaded by the `BlobServer` (and the (now) only publicly useful method of `BlobClient` to upload jar files). All other accesses should go through the two BLOB store classes. This PR is based upon #4238 in a series to implement FLINK-6916. It contains some development history and should be squashed. - [X] General - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text") - The pull request addresses only one issue - Each commit in the PR has a meaningful commit message (including the JIRA id) - [X] Documentation - Documentation has been added for new functionality - Old documentation affected by the pull request has been updated - JavaDoc for public methods has been added - [X] Tests & Build - Functionality added by the pull request is covered by tests - `mvn clean verify` has been executed successfully locally or a Travis build has passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/NicoK/flink flink-6916-7068 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4358.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4358 ---- commit 1f86efd013a81e84ba1556de0a04e4ac70229f79 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-22T15:31:17Z [FLINK-7053][blob] remove code duplication in BlobClientSslTest This lets BlobClientSslTest extend BlobClientTest as most of its implementation came from there and was simply copied. commit ff083e850edb8f5f383f54f19519116e10308d61 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-23T09:40:34Z [FLINK-7053][blob] verify some of the buffers returned by GET commit e138569019f5c75b70a21085e4829cb6cd1e93bc Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-23T10:04:10Z [FLINK-7053][blob] use TemporaryFolder for local BLOB dir in unit tests This replaces the use of some temporary directory where it is not guaranteed that it will be deleted after the test. commit 70f06d9b38c5d8afb50b6c18b7b3a4e07e2bb3da Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-21T12:45:31Z [FLINK-7054][blob] remove LibraryCacheManager#getFile() This was only used in tests where it is avoidable but if used anywhere else, it may have caused cleanup issues. commit 5463f65aeb3ed519b51ccc78abfcc003bf02b3f8 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-16T21:13:56Z [FLINK-7054][hotfix] fix a checkstyle error commit 12052c0b8c8b869f9b6bd6f9d53c9ed6e362c631 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-21T14:14:15Z [FLINK-7055][blob] refactor getURL() to the more generic getFile() The fact that we always returned URL objects is a relic of the BlobServer's only use for URLClassLoader. Since we'd like to extend its use, returning File objects instead is more generic. commit 9dd7ba652ff1b09fb5bf44e8b6b5b885a56873ab Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-21T16:04:43Z [FLINK-7056][blob] add API to allow job-related BLOBs to be stored commit 90f779ba66728979977bd794e8f68b207471dbc2 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-23T17:17:07Z [FLINK-7056][blob] refactor the new API for job-related BLOBs For a cleaner API, instead of having a nullable jobId parameter, use two methods: one for job-related BLOBs, another for job-unrelated ones. commit 6d1fda80c6f9869676030fd86177bbc7e981b1f9 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-16T21:22:12Z [FLINK-7056][hotfix] fix a checkstyle error commit cdbb0fb32052c8163bc83df151a0c8354846e4dc Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-27T16:29:44Z [FLINK-7057][blob] move ref-counting from the LibraryCacheManager to the BlobCache Also change from BlobKey-based ref-counting to job-based ref-counting which is simpler and the mode we want to use from now on. Deferred cleanup (as before) is currently not implemented yet (TODO). At the BlobServer, no ref-counting will be used but the cleanup will happen when the job enters a final state (TODO). commit 8617c9e8ffbdece8d02815ef71dc09f2d764453a Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-28T09:31:39Z [FLINK-7057][blob] change to a cleaner API for BlobService#registerJob() commit 20cf9bf1fd2b382b6d45dc7b15c7be9bed7f5f16 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-28T12:09:11Z [FLINK-7057][blob] implement deferred cleanup at the BlobCache Whenever a job is not referenced at the BlobCache anymore, we set a TTL and let the cleanup task remove it when this is hit and the task is run. For now, this means that a BLOB will be retained at most (2 * ConfigConstants.LIBRARY_CACHE_MANAGER_CLEANUP_INTERVAL) seconds after not being referenced anymore. We do this so that a recovery still has the chance to use existing files rather than to download them again. commit 270f3957b6d5e9061d528158a86eb1837ad2786c Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-28T15:17:06Z [FLINK-7057][blob] integrate cleanup of job-related JARs from the BlobServer TODO: an integration test that verifies that this is actually done when desired and not performed when not, e.g. if the job did not reach a final execution state commit 39300a32c90c82f8044cfd6d5fcaf1b83b250c87 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-30T12:52:19Z [FLINK-7057][tests] extract FailingBlockingInvokable from CoordinatorShutdownTest commit 56113349799669326074d269d6c07bb19bd45788 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-06-30T12:56:14Z [FLINK-7057][blob] add an integration test for the BlobServer cleanup This ensures that BLOB files are actually deleted when a job enters a final state. commit 93861f1f624c5b514fcdd041b592e0c40c66d103 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-03T09:25:29Z [FLINK-7057][tests] refrain from catching an exception just to fail the test removes code like this in the BLOB store unit tests: catch (Exception e) { e.printStackTrace(); fail(e.getMessage()); } commit 80a05142a68f8a3d0aaf1d447ab87a1e506364a4 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-03T11:45:33Z [FLINK-7057][blob] fix BlobServer#cleanupJob() being too eager Instead of deleting the job's directory, it was deleting the parent storage directory. commit de4ca2a7c676e7e1a82172ce680ae5a5f6693df4 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-03T15:03:50Z [FLINK-7057][blob] fix BlobServer cleanup integration * the test did not check the correct directories for cleanup * the test did not honour the test timeout commit dc58fa5d5ffc299b493dc1a78f26aa1914036151 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-03T15:11:18Z [FLINK-7057][blob] test and fix BlobServer cleanup for a failed job submission commit d1de87bfd34c5bb1f2dde0402bcceac6013c150b Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T07:55:29Z [FLINK-7057][blob] rework the LibraryCacheManager API Since ref-counting has moved to the BlobCache, the BlobLibraryCacheManager is just a thin wrapper to get a user class loader by retrieving BLOBs from the BlobCache/BlobServer. Therefore, move the job-registration/-release out of it, too, and restrict its use to the task manager where the BlobCache is used (on the BlobServer, jobs do not need registration since they are only used once and will be deleted when they enter a final state). This makes the BlobServer and BlobCache instances available at the JobManager and TaskManager instances, respectively, also enabling future use cases outside of the LibraryCacheManager. commit b62f1f640e33126523eaaa7bd51f0b24962c67fd Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T09:50:07Z [hotfix] increase Scala checkstyle maxParameters to 20 commit 5e9a9a7cd779a8e0d28947bcfa64cc16c02857f2 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T09:51:28Z [FLINK-7057][blob] address PR comments commit 5fa960676281fc1473105c849fbce9cb22cacacb Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T12:41:18Z [FLINK-7057][blob] fix JobManagerLeaderElectionTest commit a6ca6812af7b4130e77775fd3896f3f44a357334 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-04T20:49:12Z [FLINK-7057][blob] re-introduce some ref-counting for BlobLibraryCacheManager Apparently, we do need to return the same ClassLoader for different (parallel) tasks of a job running on the same task manager. Therefore, keep the initial task registration implementation that was removed with 8331fbb208d975e0c1ec990344c14315ea08dd4a and only adapt it here. This also restores some tests and adds new combinations not tested before. commit da1be581529813ec359f1ac09a00bc564a3812dd Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-06T07:38:17Z [FLINK-7057][blob] address PR comments commit b02807e75cea960fd2c89e49656be2e6f6e32394 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-10T11:40:00Z [FLINK-7057][tests] fix (manual/ignored) BlobCacheCleanupTest#testJobDeferredCleanup() commit 0f9f7d7f90497708cd2589b34573f10be0b9d331 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-16T21:30:43Z [FLINK-7057][hotfix] fix a checkstyle error commit 742488145b6e5ae94f29371915b2d5b08e4ff655 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-11T09:19:20Z [FLINK-7068][blob] start introducing a new BLOB storage abstraction This is incomplete and may not compile and/or run tests successfully yet. commit de88cede820418628779957af76a9d49b535c26b Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-11T10:29:38Z [FLINK-7068][blob] remove BlobView from TransientBlobCache The transient BLOB cache is not supposed to work with the HA store since it only serves non-HA files. commit 99edd8966e3032110be8d2e240440da13dd2f655 Author: Nico Kruber <n...@data-artisans.com> Date: 2017-07-11T13:04:00Z [FLINK-7068][blob] remove unnecessary use of BlobClient ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---