Hey Joe! This sounds odd... are there any failures (JobManager or TaskManager) or leader elections being reported? You should see such events in the JobManager/TaskManager logs.
On Tue, May 16, 2017 at 2:28 PM, Joe Olson <jo4...@outlook.com> wrote: > When running Flink in high availability mode, I've been seeing a high number > of UnknownKvStateKeyGroupLocation errors being returned when using queryable > state calls. > > > If I put a simple getKvState call into a loop executing every second, and > call it repeatedly, sometimes I will get the expected results, sometimes I > will get UnknownKvStateKeyGroupLocation thrown. This is not associated with > a query timeout (network issue). > > > From looking at the Flink source code, this problem stems from a failure of > lookup.getKvStateServerAddress returning null. I know all the task managers > are registering state with the job manager, because I see the "Key value > state registered for job xx under name yy" messages in the job server log. > > > Anything else I should be looking for? I have several jobs I am querying > state on, and this seems isolated to only one. I've gone over very closely > the difference between the jobs, but they all built from the same template. > > > What would cause a lookup.getKvStateServerAddress to sometimes succeed, and > sometimes to fail? > > >