When running Flink in high availability mode, I've been seeing a high number of 
UnknownKvStateKeyGroupLocation errors being returned when using queryable state 
calls.


If I put a simple getKvState call into a loop executing every second, and call 
it repeatedly, sometimes I will get the expected results, sometimes I will get 
UnknownKvStateKeyGroupLocation thrown. This is not associated with a query 
timeout (network issue).


>From looking at the Flink source code, this problem stems from a failure of 
>lookup.getKvStateServerAddress returning null. I know all the task managers 
>are registering state with the job manager, because I see the "Key value state 
>registered for job xx under name yy" messages in the job server log.


Anything else I should be looking for? I have several jobs I am querying state 
on, and this seems isolated to only one. I've gone over very closely the 
difference between the jobs, but they all built from the same template.


What would cause a lookup.getKvStateServerAddress to sometimes succeed, and 
sometimes to fail?


Reply via email to