Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/6147#discussion_r195065351 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/cache/DistributedCache.java --- @@ -40,6 +41,14 @@ @Public public class DistributedCache { + /** + * An entry for a single file or directory that should be cached. + * + * <p>Entries have different semantics for local directories depending on where we are in the job-submission process. + * After registration through the API {@code filePath} denotes the original directory. + * Before the job is submitted to the cluster directories are zipped, at which point {@code filePath} denotes the path to the local zip. + * After the upload to the cluster, {@code filePath} denotes the (server-side) copy of the zip. + */ public static class DistributedCacheEntry implements Serializable { --- End diff -- It might be out of scope of this PR but I think the `DistributedCacheEntry` mixes too many responsibilities. On the one hand it is used to transport cache entry information like `isZipped`, `blobKey` and `isExecutable` which is only relevant for the job submission. On the other hand, it also contains information about which files to transmit to the cluster at the job creation time. I think it would be a good idea to separate these responsibilities. As a side effect, we would not have `nullable` fields such as the `blobKey` in this class.
---