Hi Yangze,

thanks for creating this FLIP. I think it is a very good improvement
helping our users and ourselves understanding better what's going on in
Flink.

Creating the ResourceIDs with host information/pod name is a good idea.

Also deriving ExecutionGraph IDs from their superset ID is a good idea.

The InstanceID is used for fencing purposes. I would not make it a
composition of the ResourceID + a monotonically increasing number. The
problem is that in case of a RM failure the InstanceIDs would start from 0
again and this could lead to collisions.

Logging more information on how the different runtime IDs are correlated is
also a good idea.

Two other ideas for simplifying the ids are the following:

* The SlotRequestID was introduced because the SlotPool was a separate
RpcEndpoint a while ago. With this no longer being the case I think we
could remove the SlotRequestID and replace it with the AllocationID.
* Instead of creating new SlotRequestIDs for multi task slots one could
derive them from the SlotRequestID used for requesting the underlying
AllocatedSlot.

Given that the slot sharing logic will most likely be reworked with the
pipelined region scheduling, we might be able to resolve these two points
as part of the pipelined region scheduling effort.

Cheers,
Till

On Thu, Mar 26, 2020 at 10:51 AM Yangze Guo <karma...@gmail.com> wrote:

> Hi everyone,
>
> We would like to start a discussion thread on "FLIP-118: Improve
> Flink’s ID system"[1].
>
> This FLIP mainly discusses the following issues, target to enhance the
> readability of IDs in log and help user to debug in case of failures:
>
> - Enhance the readability of the string literals of IDs. Most of them
> are hashcodes, e.g. ExecutionAttemptID, which do not provide much
> meaningful information and are hard to recognize and compare for
> users.
> - Log the ID’s lineage information to make debugging more convenient.
> Currently, the log fails to always show the lineage information
> between IDs. Finding out relationships between entities identified by
> given IDs is a common demand, e.g., slot of which AllocationID is
> assigned to satisfy slot request of with SlotRequestID. Absence of
> such lineage information, it’s impossible to track the end to end
> lifecycle of an Execution or a Task now, which makes debugging
> difficult.
>
> Key changes proposed in the FLIP are as follows:
>
> - Add location information to distributed components
> - Add topology information to graph components
> - Log the ID’s lineage information
> - Expose the identifier of distributing component to user
>
> Please find more details in the FLIP wiki document [1]. Looking forward to
> your feedbacks.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148643521
>
> Best,
> Yangze Guo
>

Reply via email to