Re: Apache Flink - Uid and name for Flink sources and sinks

Dian Fu Sun, 24 Nov 2019 17:58:35 -0800

Hi Mans,

Please see my reply inline below.



> 在 2019年11月25日，上午5:42，M Singh <[email protected]> 写道：
> 
> Thanks Dian for your answers.
> 
> A few more questions:
> 
> 1. If I do not assign uids to operators/sources and sinks - I am assuming the 
> framework assigns it one.  Now how does another run of the the same 
> application using the previous runs savepoint/checkpoint match it's 
> tasks/operators to the savepoint/checkpoint state of the application ? 

You are right that the framework will generate an uid for an operator if it's 
not assigned. The uid is generated in a deterministic way to ensure that the 
uid for the same operator remains the same as previous runs(under certain 
conditions). 

The uid generation algorithm:
https://github.com/apache/flink/blob/fd511c345eac31f03b801ff19dbf1f8c86aae760/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphHasherV2.java#L78
 
<https://github.com/apache/flink/blob/fd511c345eac31f03b801ff19dbf1f8c86aae760/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphHasherV2.java#L78>


> 2. Is the operatorID in the checkpoint state the same as uid ?  
> 3. Do you have any pointer as to how an operatorID is generated for the 
> checkpoint and who can it be mapped to back to the operator for 
> troubleshooting purposes ?

The OperatorID is constructed from the uid and they are the same:
https://github.com/apache/flink/blob/66b979efc7786edb1a207339b8670d0e82c459a7/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamingJobGraphGenerator.java#L292
 
<https://github.com/apache/flink/blob/66b979efc7786edb1a207339b8670d0e82c459a7/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamingJobGraphGenerator.java#L292>
> 
> Regarding id attribute - I meant the following:
> 
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java#L139
>  
> <https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java#L139>
> However, I realized that this is not unique across applications runs and so 
> not a good candidate.
> 
> Thanks again for your help.
> 
> 
> 
> 
> 
> On Sunday, November 24, 2019, 04:55:55 AM EST, Dian Fu 
> <[email protected]> wrote:
> 
> 
> 1. Should we assign uid and name to the sources and sinks too ?  
> >> If the sources/sinks have used state, you should assign uid for them. This 
> >> is usually true for sources. 
> 
> 2. What are the pros and cons of adding uid to sources and sinks ?
> >> I'm not seeing the cons for assigning uid to sources and sinks. So I guess 
> >> assigning the uids for sources/sinks is always a good practice.
> 
> 3. The sinks have uid and hashUid - which is the preferred attribute to use  
> for allowing job restarts ?
> >> Could you see if this could answer you question: 
> >> https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash
> >>  
> >> <https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash>
> 
> 4. If sink and sources uid are not provided in the application, can they 
> still maintain state across job restarts from checkpoints ?
> >> It depends on whether the sources/sinks uses state. I think most sources 
> >> use state to maintaining the read offset.
>   
> 5. Can the sinks and sources without uid restart from savepoints ?
> >> The same as above.
> 
> 6. The data streams have an attribute id -  How is this generated and can 
> this be used for creating a uid for the sink ?  
> >> Not sure what do you mean by "attribute id". Could you give some more 
> >> detailed information about it?
> 
> Regards,
> Dian
> 
> On Fri, Nov 22, 2019 at 6:27 PM M Singh <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Hi Folks - Please let me know if you have any advice on the best practices 
> for setting uid for sources and sinks.  Thanks.  Mans
> On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh 
> <[email protected] <mailto:[email protected]>> wrote:
> 
> 
> Hi Folks:
> 
> I am assigning uid and name for all stateful processors in our application 
> and wanted to find out the following:
> 
> 1. Should we assign uid and name to the sources and sinks too ?  
> 2. What are the pros and cons of adding uid to sources and sinks ?
> 3. The sinks have uid and hashUid - which is the preferred attribute to use  
> for allowing job restarts ?
> 4. If sink and sources uid are not provided in the application, can they 
> still maintain state across job restarts from checkpoints ?  
> 5. Can the sinks and sources without uid restart from savepoints ?
> 6. The data streams have an attribute id -  How is this generated and can 
> this be used for creating a uid for the sink ?  
> 
> Thanks for your help.
> 
> Mans

Re: Apache Flink - Uid and name for Flink sources and sinks

Reply via email to