Re: [DISCUSS] FLIP-118: Improve Flink’s ID system

Zhu Zhu Tue, 31 Mar 2020 01:55:54 -0700

Thanks for proposing this improvement Yangze. Big +1 for the overall
proposal. It can help a lot in troubleshooting.


Here are a few questions for it:
1. Shall we make JobVertexID a composition of JobID and a topology index?
This would help in the session cluster case, so that we can identify which
tasks are from which jobs along with the rework of ExecutionAttemptID.

2. You mentioned that "Add the producer info to the string literal of
IntermediateDataSetID". Do you mean to make IntermediateDataSetID a
composition of JobVertexID and a consumer index?
How about we still keep IntermediateDataSetID independent from JobVertexID,
but just print the producing relationships in logs? I think keeping
IntermediateDataSetID independent may be better considering the cross job
result usages in interactive query cases.

3. The new IDs will become larger with this rework. The
TaskDeploymentDescriptor can become much larger since it is mainly composed
of a variety DIs. I'm not sure how much it would be but there can be more
memory and CPU cost for it, and results in more frequent GCs, message size
exceeding akka frame limits, and a longer blocked time of main thread.
This should not be a problem in most cases but might be a problem for large
scale jobs. Shall we have an benchmark for it?

Thanks,
Zhu Zhu

Yangze Guo <[email protected]> 于2020年3月31日周二 下午2:19写道：

> Thank you all for the feedback! Sorry for the belated reply.
>
> @Till
> I'm +1 for your two ideas and I'd like to move these two out of the
> scope of this FLIP since the pipelined region scheduling is an ongoing
> work now.
> I also agree that we should not make the InstanceID in
> TaskExecutorConnection being composed of the ResourceID plus a
> monotonically increasing value. Thanks a lot for your explanation.
>
> @Konstantin @Yang
> Regarding the PodName of TaskExecutor on K8s, I second Yang's
> suggestion. It makes sense to me to let user export RESOURCE_ID and
> make TM respect it. User needs to guarantee there is no collision for
> different TM.
>
> Best,
> Yangze Guo
>
>
> On Tue, Mar 31, 2020 at 12:25 AM Steven Wu <[email protected]> wrote:
> >
> > +1 on allowing user defined resourceId for taskmanager
> >
> > On Sun, Mar 29, 2020 at 7:24 PM Yang Wang <[email protected]> wrote:
> >
> > > Hi Konstantin,
> > >
> > > I think it is a good idea. Currently, our users also report a similar
> issue
> > > with
> > > resourceId of standalone cluster. When we start a standalone cluster
> now,
> > > the `TaskManagerRunner` always generates a uuid for the resourceId. It
> will
> > > be used to register to the jobmanager and not convenient to match with
> the
> > > real
> > > taskmanager, especially in container environment.
> > >
> > > I think a probably solution is we could support the user defined
> > > resourceId.
> > > We could get it from the environment. For standalone on K8s, we could
> set
> > > the "RESOURCE_ID" env to the pod name so that it is easier to match the
> > > taskmanager with K8s pod.
> > >
> > > Moreover, i am afraid we could not set the pod name to the resourceId.
> I
> > > think
> > > you could set the "deployment.meta.name". Since the pod name is
> generated
> > > by
> > > K8s in the pattern {deployment.meta.nane}-{rc.uuid}-{uuid}. On the
> > > contrary, we
> > > will set the resourceId to the pod name.
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > Konstantin Knauf <[email protected]> 于2020年3月29日周日 下午8:06写道：
> > >
> > > > Hi Yangze, Hi Till,
> > > >
> > > > thanks you for working on this topic. I believe it will make
> debugging
> > > > large Apache Flink deployments much more feasible.
> > > >
> > > > I was wondering whether it would make sense to allow the user to
> specify
> > > > the Resource ID in standalone setups?  For example, many users still
> > > > implicitly use standalone clusters on Kubernetes (the native support
> is
> > > > still experimental) and in these cases it would be interesting to
> also
> > > set
> > > > the PodName as the ResourceID. What do you think?
> > > >
> > > > Cheers,
> > > >
> > > > Kosntantin
> > > >
> > > > On Thu, Mar 26, 2020 at 6:49 PM Till Rohrmann <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi Yangze,
> > > > >
> > > > > thanks for creating this FLIP. I think it is a very good
> improvement
> > > > > helping our users and ourselves understanding better what's going
> on in
> > > > > Flink.
> > > > >
> > > > > Creating the ResourceIDs with host information/pod name is a good
> idea.
> > > > >
> > > > > Also deriving ExecutionGraph IDs from their superset ID is a good
> idea.
> > > > >
> > > > > The InstanceID is used for fencing purposes. I would not make it a
> > > > > composition of the ResourceID + a monotonically increasing number.
> The
> > > > > problem is that in case of a RM failure the InstanceIDs would start
> > > from
> > > > 0
> > > > > again and this could lead to collisions.
> > > > >
> > > > > Logging more information on how the different runtime IDs are
> > > correlated
> > > > is
> > > > > also a good idea.
> > > > >
> > > > > Two other ideas for simplifying the ids are the following:
> > > > >
> > > > > * The SlotRequestID was introduced because the SlotPool was a
> separate
> > > > > RpcEndpoint a while ago. With this no longer being the case I
> think we
> > > > > could remove the SlotRequestID and replace it with the
> AllocationID.
> > > > > * Instead of creating new SlotRequestIDs for multi task slots one
> could
> > > > > derive them from the SlotRequestID used for requesting the
> underlying
> > > > > AllocatedSlot.
> > > > >
> > > > > Given that the slot sharing logic will most likely be reworked
> with the
> > > > > pipelined region scheduling, we might be able to resolve these two
> > > points
> > > > > as part of the pipelined region scheduling effort.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Thu, Mar 26, 2020 at 10:51 AM Yangze Guo <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > We would like to start a discussion thread on "FLIP-118: Improve
> > > > > > Flink’s ID system"[1].
> > > > > >
> > > > > > This FLIP mainly discusses the following issues, target to
> enhance
> > > the
> > > > > > readability of IDs in log and help user to debug in case of
> failures:
> > > > > >
> > > > > > - Enhance the readability of the string literals of IDs. Most of
> them
> > > > > > are hashcodes, e.g. ExecutionAttemptID, which do not provide much
> > > > > > meaningful information and are hard to recognize and compare for
> > > > > > users.
> > > > > > - Log the ID’s lineage information to make debugging more
> convenient.
> > > > > > Currently, the log fails to always show the lineage information
> > > > > > between IDs. Finding out relationships between entities
> identified by
> > > > > > given IDs is a common demand, e.g., slot of which AllocationID is
> > > > > > assigned to satisfy slot request of with SlotRequestID. Absence
> of
> > > > > > such lineage information, it’s impossible to track the end to end
> > > > > > lifecycle of an Execution or a Task now, which makes debugging
> > > > > > difficult.
> > > > > >
> > > > > > Key changes proposed in the FLIP are as follows:
> > > > > >
> > > > > > - Add location information to distributed components
> > > > > > - Add topology information to graph components
> > > > > > - Log the ID’s lineage information
> > > > > > - Expose the identifier of distributing component to user
> > > > > >
> > > > > > Please find more details in the FLIP wiki document [1]. Looking
> > > forward
> > > > > to
> > > > > > your feedbacks.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148643521
> > > > > >
> > > > > > Best,
> > > > > > Yangze Guo
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Konstantin Knauf | Head of Product
> > > >
> > > > +49 160 91394525
> > > >
> > > >
> > > > Follow us @VervericaData Ververica <https://www.ververica.com/>
> > > >
> > > >
> > > > --
> > > >
> > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > > > Conference
> > > >
> > > > Stream Processing | Event Driven | Real Time
> > > >
> > > > --
> > > >
> > > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > > >
> > > > --
> > > > Ververica GmbH
> > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason,
> Ji
> > > > (Tony) Cheng
> > > >
> > >
>

Re: [DISCUSS] FLIP-118: Improve Flink’s ID system

Reply via email to