On a high level we call intermediate data produced by programs "intermediate 
results". For example in a WordCount map-reduce program the map function 
produces an intermediate result, which consists of (word, 1) pairs and the 
reduce function consumes this intermediate result. Kostas has recently added 
documentation explaining the core concepts [1].

The naming of classes related to intermediate results is inconsistent (and 
probably confusing).

- In JobGraphs (internal low-level API to define programs) they are called 
IntermediateDataSet and identified by IntermediateDataSetIDs.

- In ExecutionGraphs (JobManager structure used for state tracking/scheduling) 
they are called IntermediateResult at the ExecutionJobVertex (identified by 
IntermediateDataSetID) and IntermediateResultPartition at the ExecutionVertex 
(identified by IntermediateResultPartitionID).

- At runtime (TaskManager) they are called ResultPartition and identified by 
ResultPartitionID (composition of ExecutionAttemptID and 
IntermediateResultPartitionID). These are further subpartitioned into 
ResultSubpartition instances.

I propose to get the naming more in line with the existing naming scheme and 
prefix it with the corresponding managemenet structures:

1) IntermediateDataSet => JobVertexResult (identified by JobVertexResultID)
2) IntermediateResult => ExecutionJobVertexResult (identified by 
JobVertexResultID)
3) IntermediateResultPartition => ExecutionVertexResult (identified by 
ExecutionVertexResultID)
4) ResultPartition => Result
5) ResultSubpartition => ResultPartition

These names are non-user facing, but still at the core of the system. I think 
that consistent naming of these classes will make it easier for new 
contributors to get an overview of how single components relate to each other 
(the prefixes indicate this). In the docs, we can still refer to the high-level 
concept as "intermediate results".

What's your opinion on this? I think now is a good time to think about this 
stuff, because the core classes have only been added recently to the system. 
Feel free to propose alternatives. :-)

– Ufuk

[1] 
https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

Reply via email to