Hi Seye, Thanks for digging into the problem.
As Vino and Jorn suggested, this looks like a bug and please file a JIRA issue. It would be also nice if you could post it here so that we know the related discussion. Cheers, Kostas > On Oct 14, 2018, at 9:46 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > You have to file an issue. One workaround to see if this really fixes your > problem could be to use reflection to mark this method as public and then > call it (it is of course nothing for production code). You can also try a > newer Flink version. > >> Am 13.10.2018 um 18:02 schrieb Seye Jin <seyej...@gmail.com>: >> >> I recently upgraded to flink 1.4 from 1.3 and leverage Queryable State >> client in my application. I have 1 jm and 5 tm all serviced behind >> kubernetes. A large state is built and distributed evenly across task >> mangers and the client can query state for specified key >> >> Issue: if a task manager dies and a new one gets spun up(automatically) and >> the QS states successfully recover in new nodes/task slots. I start to get >> time out exception when the client tries to query for key, even if I try to >> reset or re-deploy the client jobs >> >> I have been trying to triage this and figure out a way to remediate this >> issue and I found that in KvStateClientProxyHandler which is not exposed in >> code, there is a forceUpdate flag that can help reset KvStateLocations(plus >> inetAddresses) but the default is false and can't be overriden >> >> I was wandering if anyone knows how to remediate this kind of issue or if >> there is a way to have the jobmanager know that the task manager location in >> cache is no more valid. >> >> Any tip to resolve this will be appreciated (I can't downgrade back to 1.3 >> or upgrade from 1.4) >>