Re: [DISCUSS] Inconsistent naming of intermediate results

Stephan Ewen Tue, 31 Mar 2015 09:17:28 -0700

I like getting the consistency in there.

I was never thinking of the intermediate data sets to be strictly produced
by a vertex, so I am unsure whether we should use that exact naming scheme,
or one that disconnects the results from the term "VertexResult".


On Tue, Mar 31, 2015 at 5:27 PM, Kostas Tzoumas <ktzou...@apache.org> wrote:

> I like the fact that the naming scheme follows some logic.
>
> I also like that we have two easy to understand concepts:
> - Operator (be that in any of the above representations)
> - Result (of executing an operator)
>
> +1
>
> On Tue, Mar 31, 2015 at 4:50 PM, Ufuk Celebi <u...@apache.org> wrote:
>
> > On a high level we call intermediate data produced by programs
> > "intermediate results". For example in a WordCount map-reduce program the
> > map function produces an intermediate result, which consists of (word, 1)
> > pairs and the reduce function consumes this intermediate result. Kostas
> has
> > recently added documentation explaining the core concepts [1].
> >
> > The naming of classes related to intermediate results is inconsistent
> (and
> > probably confusing).
> >
> > - In JobGraphs (internal low-level API to define programs) they are
> called
> > IntermediateDataSet and identified by IntermediateDataSetIDs.
> >
> > - In ExecutionGraphs (JobManager structure used for state
> > tracking/scheduling) they are called IntermediateResult at the
> > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > IntermediateResultPartition at the ExecutionVertex (identified by
> > IntermediateResultPartitionID).
> >
> > - At runtime (TaskManager) they are called ResultPartition and identified
> > by ResultPartitionID (composition of ExecutionAttemptID and
> > IntermediateResultPartitionID). These are further subpartitioned into
> > ResultSubpartition instances.
> >
> > I propose to get the naming more in line with the existing naming scheme
> > and prefix it with the corresponding managemenet structures:
> >
> > 1) IntermediateDataSet => JobVertexResult (identified by
> JobVertexResultID)
> > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > JobVertexResultID)
> > 3) IntermediateResultPartition => ExecutionVertexResult (identified by
> > ExecutionVertexResultID)
> > 4) ResultPartition => Result
> > 5) ResultSubpartition => ResultPartition
> >
> > These names are non-user facing, but still at the core of the system. I
> > think that consistent naming of these classes will make it easier for new
> > contributors to get an overview of how single components relate to each
> > other (the prefixes indicate this). In the docs, we can still refer to
> the
> > high-level concept as "intermediate results".
> >
> > What's your opinion on this? I think now is a good time to think about
> > this stuff, because the core classes have only been added recently to the
> > system. Feel free to propose alternatives. :-)
> >
> > – Ufuk
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Reply via email to