I like getting the consistency in there. I was never thinking of the intermediate data sets to be strictly produced by a vertex, so I am unsure whether we should use that exact naming scheme, or one that disconnects the results from the term "VertexResult".
On Tue, Mar 31, 2015 at 5:27 PM, Kostas Tzoumas <ktzou...@apache.org> wrote: > I like the fact that the naming scheme follows some logic. > > I also like that we have two easy to understand concepts: > - Operator (be that in any of the above representations) > - Result (of executing an operator) > > +1 > > On Tue, Mar 31, 2015 at 4:50 PM, Ufuk Celebi <u...@apache.org> wrote: > > > On a high level we call intermediate data produced by programs > > "intermediate results". For example in a WordCount map-reduce program the > > map function produces an intermediate result, which consists of (word, 1) > > pairs and the reduce function consumes this intermediate result. Kostas > has > > recently added documentation explaining the core concepts [1]. > > > > The naming of classes related to intermediate results is inconsistent > (and > > probably confusing). > > > > - In JobGraphs (internal low-level API to define programs) they are > called > > IntermediateDataSet and identified by IntermediateDataSetIDs. > > > > - In ExecutionGraphs (JobManager structure used for state > > tracking/scheduling) they are called IntermediateResult at the > > ExecutionJobVertex (identified by IntermediateDataSetID) and > > IntermediateResultPartition at the ExecutionVertex (identified by > > IntermediateResultPartitionID). > > > > - At runtime (TaskManager) they are called ResultPartition and identified > > by ResultPartitionID (composition of ExecutionAttemptID and > > IntermediateResultPartitionID). These are further subpartitioned into > > ResultSubpartition instances. > > > > I propose to get the naming more in line with the existing naming scheme > > and prefix it with the corresponding managemenet structures: > > > > 1) IntermediateDataSet => JobVertexResult (identified by > JobVertexResultID) > > 2) IntermediateResult => ExecutionJobVertexResult (identified by > > JobVertexResultID) > > 3) IntermediateResultPartition => ExecutionVertexResult (identified by > > ExecutionVertexResultID) > > 4) ResultPartition => Result > > 5) ResultSubpartition => ResultPartition > > > > These names are non-user facing, but still at the core of the system. I > > think that consistent naming of these classes will make it easier for new > > contributors to get an overview of how single components relate to each > > other (the prefixes indicate this). In the docs, we can still refer to > the > > high-level concept as "intermediate results". > > > > What's your opinion on this? I think now is a good time to think about > > this stuff, because the core classes have only been added recently to the > > system. Feel free to propose alternatives. :-) > > > > – Ufuk > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks >