Hi Stephan,
I guess this is the case. Our cluster is a bit overloaded network-wise,
so sometime a Task Manager got disconnected, which causes the restart of
the entire job,
leading to multiple segfaults in other task managers, prolonging recovery.
We're upgrading the network, hopefully the p
Hi,
I would assume that those segfaults are only observed *after* a job is already
in the process of canceling? This is a known problem, but currently „accepted“
behaviour after discussions with Stephan and Aljoscha (in CC). From that
discussion, the background is that the native RocksDB resour
Hi,
We are using processing timer to implement some state clean up logic.
After switching from FsStateBackend to RocksDB, we encounter a lot of segfault
from the Time Trigger threads when accessing/clearing state value.
We currently uses the latest 1.3-SNAPSHOT, with the patch upgrading RocksDB