Thank you all for the feedback! Sorry for the belated reply. @Till I'm +1 for your two ideas and I'd like to move these two out of the scope of this FLIP since the pipelined region scheduling is an ongoing work now. I also agree that we should not make the InstanceID in TaskExecutorConnection being composed of the ResourceID plus a monotonically increasing value. Thanks a lot for your explanation.
@Konstantin @Yang Regarding the PodName of TaskExecutor on K8s, I second Yang's suggestion. It makes sense to me to let user export RESOURCE_ID and make TM respect it. User needs to guarantee there is no collision for different TM. Best, Yangze Guo On Tue, Mar 31, 2020 at 12:25 AM Steven Wu <stevenz...@gmail.com> wrote: > > +1 on allowing user defined resourceId for taskmanager > > On Sun, Mar 29, 2020 at 7:24 PM Yang Wang <danrtsey...@gmail.com> wrote: > > > Hi Konstantin, > > > > I think it is a good idea. Currently, our users also report a similar issue > > with > > resourceId of standalone cluster. When we start a standalone cluster now, > > the `TaskManagerRunner` always generates a uuid for the resourceId. It will > > be used to register to the jobmanager and not convenient to match with the > > real > > taskmanager, especially in container environment. > > > > I think a probably solution is we could support the user defined > > resourceId. > > We could get it from the environment. For standalone on K8s, we could set > > the "RESOURCE_ID" env to the pod name so that it is easier to match the > > taskmanager with K8s pod. > > > > Moreover, i am afraid we could not set the pod name to the resourceId. I > > think > > you could set the "deployment.meta.name". Since the pod name is generated > > by > > K8s in the pattern {deployment.meta.nane}-{rc.uuid}-{uuid}. On the > > contrary, we > > will set the resourceId to the pod name. > > > > > > Best, > > Yang > > > > Konstantin Knauf <konstan...@ververica.com> 于2020年3月29日周日 下午8:06写道: > > > > > Hi Yangze, Hi Till, > > > > > > thanks you for working on this topic. I believe it will make debugging > > > large Apache Flink deployments much more feasible. > > > > > > I was wondering whether it would make sense to allow the user to specify > > > the Resource ID in standalone setups? For example, many users still > > > implicitly use standalone clusters on Kubernetes (the native support is > > > still experimental) and in these cases it would be interesting to also > > set > > > the PodName as the ResourceID. What do you think? > > > > > > Cheers, > > > > > > Kosntantin > > > > > > On Thu, Mar 26, 2020 at 6:49 PM Till Rohrmann <trohrm...@apache.org> > > > wrote: > > > > > > > Hi Yangze, > > > > > > > > thanks for creating this FLIP. I think it is a very good improvement > > > > helping our users and ourselves understanding better what's going on in > > > > Flink. > > > > > > > > Creating the ResourceIDs with host information/pod name is a good idea. > > > > > > > > Also deriving ExecutionGraph IDs from their superset ID is a good idea. > > > > > > > > The InstanceID is used for fencing purposes. I would not make it a > > > > composition of the ResourceID + a monotonically increasing number. The > > > > problem is that in case of a RM failure the InstanceIDs would start > > from > > > 0 > > > > again and this could lead to collisions. > > > > > > > > Logging more information on how the different runtime IDs are > > correlated > > > is > > > > also a good idea. > > > > > > > > Two other ideas for simplifying the ids are the following: > > > > > > > > * The SlotRequestID was introduced because the SlotPool was a separate > > > > RpcEndpoint a while ago. With this no longer being the case I think we > > > > could remove the SlotRequestID and replace it with the AllocationID. > > > > * Instead of creating new SlotRequestIDs for multi task slots one could > > > > derive them from the SlotRequestID used for requesting the underlying > > > > AllocatedSlot. > > > > > > > > Given that the slot sharing logic will most likely be reworked with the > > > > pipelined region scheduling, we might be able to resolve these two > > points > > > > as part of the pipelined region scheduling effort. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Mar 26, 2020 at 10:51 AM Yangze Guo <karma...@gmail.com> > > wrote: > > > > > > > > > Hi everyone, > > > > > > > > > > We would like to start a discussion thread on "FLIP-118: Improve > > > > > Flink’s ID system"[1]. > > > > > > > > > > This FLIP mainly discusses the following issues, target to enhance > > the > > > > > readability of IDs in log and help user to debug in case of failures: > > > > > > > > > > - Enhance the readability of the string literals of IDs. Most of them > > > > > are hashcodes, e.g. ExecutionAttemptID, which do not provide much > > > > > meaningful information and are hard to recognize and compare for > > > > > users. > > > > > - Log the ID’s lineage information to make debugging more convenient. > > > > > Currently, the log fails to always show the lineage information > > > > > between IDs. Finding out relationships between entities identified by > > > > > given IDs is a common demand, e.g., slot of which AllocationID is > > > > > assigned to satisfy slot request of with SlotRequestID. Absence of > > > > > such lineage information, it’s impossible to track the end to end > > > > > lifecycle of an Execution or a Task now, which makes debugging > > > > > difficult. > > > > > > > > > > Key changes proposed in the FLIP are as follows: > > > > > > > > > > - Add location information to distributed components > > > > > - Add topology information to graph components > > > > > - Log the ID’s lineage information > > > > > - Expose the identifier of distributing component to user > > > > > > > > > > Please find more details in the FLIP wiki document [1]. Looking > > forward > > > > to > > > > > your feedbacks. > > > > > > > > > > [1] > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148643521 > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > -- > > > > > > Konstantin Knauf | Head of Product > > > > > > +49 160 91394525 > > > > > > > > > Follow us @VervericaData Ververica <https://www.ververica.com/> > > > > > > > > > -- > > > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > > Conference > > > > > > Stream Processing | Event Driven | Real Time > > > > > > -- > > > > > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > > > -- > > > Ververica GmbH > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > > > (Tony) Cheng > > > > >