Re: [DISCUSS] Inconsistent naming of intermediate results

Henry Saputra Tue, 31 Mar 2015 11:17:57 -0700

As one of the devs that recently been tracing the runtime portion of
the code +1 for renaming for inlining with the concepts.


One thing I would like to have is immediate change to the
documentation [1] with renaming PR . Otherwise

Then need to file followup ticket to update Kostas' awesome wiki page [2].

- Henry

[1] 
http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
[2] 
https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <u...@apache.org> wrote:
> On a high level we call intermediate data produced by programs "intermediate 
> results". For example in a WordCount map-reduce program the map function 
> produces an intermediate result, which consists of (word, 1) pairs and the 
> reduce function consumes this intermediate result. Kostas has recently added 
> documentation explaining the core concepts [1].
>
> The naming of classes related to intermediate results is inconsistent (and 
> probably confusing).
>
> - In JobGraphs (internal low-level API to define programs) they are called 
> IntermediateDataSet and identified by IntermediateDataSetIDs.
>
> - In ExecutionGraphs (JobManager structure used for state 
> tracking/scheduling) they are called IntermediateResult at the 
> ExecutionJobVertex (identified by IntermediateDataSetID) and 
> IntermediateResultPartition at the ExecutionVertex (identified by 
> IntermediateResultPartitionID).
>
> - At runtime (TaskManager) they are called ResultPartition and identified by 
> ResultPartitionID (composition of ExecutionAttemptID and 
> IntermediateResultPartitionID). These are further subpartitioned into 
> ResultSubpartition instances.
>
> I propose to get the naming more in line with the existing naming scheme and 
> prefix it with the corresponding managemenet structures:
>
> 1) IntermediateDataSet => JobVertexResult (identified by JobVertexResultID)
> 2) IntermediateResult => ExecutionJobVertexResult (identified by 
> JobVertexResultID)
> 3) IntermediateResultPartition => ExecutionVertexResult (identified by 
> ExecutionVertexResultID)
> 4) ResultPartition => Result
> 5) ResultSubpartition => ResultPartition
>
> These names are non-user facing, but still at the core of the system. I think 
> that consistent naming of these classes will make it easier for new 
> contributors to get an overview of how single components relate to each other 
> (the prefixes indicate this). In the docs, we can still refer to the 
> high-level concept as "intermediate results".
>
> What's your opinion on this? I think now is a good time to think about this 
> stuff, because the core classes have only been added recently to the system. 
> Feel free to propose alternatives. :-)
>
> – Ufuk
>
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

Re: [DISCUSS] Inconsistent naming of intermediate results

Reply via email to