BTW, do you have a bare minimum application which can reproduce the issue? On Fri, Feb 14, 2025 at 5:07 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote:
> Hi Peter, > > I've had a look on this and seems like this problem or a similar one still > exists in 1.20. > Since I've not done in-depth analysis I can only say this by taking a look > at the last comments, like: > > Faced an issue on Flink 1.20.0 > > but there are more of such... > > So all in all I've the feeling that this area still has some edge case or > race issue(s) left. > > BR, > G > > > On Tue, Jan 7, 2025 at 4:26 PM Peter Westermann <no.westerm...@genesys.com> > wrote: > >> In the past, we have infrequently had issues with corrupted timer state >> (see https://issues.apache.org/jira/browse/FLINK-23886), which resolve, >> when we go back to a previous savepoint and restart the job from that. >> Recently (since we upgraded to Flink 1.19.1), we have seen more state >> corruption issues for checkpoint/savepoint data in RocksDB. Each time, the >> resolution has been to go back to a previous savepoint and replay events >> and everything was fine. >> >> Issues we have seen: >> >> - Exceptions when deserializing MapState entries, with an >> EOFException at the start of key deserialization: >> >> j.io.EOFException: null >> >> at >> o.a.f.c.m.DataInputDeserializer.readUnsignedShort(DataInputDeserializer.java:339) >> >> at >> o.a.f.c.m.DataInputDeserializer.readUTF(DataInputDeserializer.java:251) >> >> at >> c.g.a.f.o.t.s.ObservationKeySerializer.deserialize(ObservationKeySerializer.java:68) >> >> at >> c.g.a.f.o.t.s.ObservationKeySerializer.deserialize(ObservationKeySerializer.java:17) >> >> at >> o.a.f.c.s.s.RocksDBMapState.deserializeUserKey(RocksDBMapState.java:390) >> >> at o.a.f.c.s.s.RocksDBMapState.access$000(RocksDBMapState.java:66) >> >> at >> o.a.f.c.s.s.RocksDBMapState$RocksDBMapEntry.getKey(RocksDBMapState.java:493) >> >> ... 27 common frames omitted >> >> Wrapped by: o.a.f.u.FlinkRuntimeException: Error while deserializing the >> user key. >> >> at >> o.a.f.c.s.s.RocksDBMapState$RocksDBMapEntry.getKey(RocksDBMapState.java:496) >> >> at >> c.g.a.f.o.t.ObservationsWrapper.filterObservations(ObservationsWrapper.java:83) >> >> - Same with ValueState, with an EOFException immediatel when >> deserializing >> >> java.io.EOFException: null >> >> at >> o.a.f.c.m.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:329) >> >> at o.a.f.t.StringValue.readString(StringValue.java:781) >> >> at o.a.f.a.c.t.b.StringSerializer.deserialize(StringSerializer.java:73) >> >> at o.a.f.a.c.t.b.StringSerializer.deserialize(StringSerializer.java:31) >> >> at o.a.f.a.j.t.r.PojoSerializer.deserialize(PojoSerializer.java:463) >> >> at o.a.f.c.s.s.RocksDBValueState.value(RocksDBValueState.java:88) >> >> ... 23 common frames omitted >> >> - Exceptions for operators with multiple ValueStates, where >> deserialization fails due to type issues: >> >> java.lang.IllegalArgumentException: Can not set java.util.Map field >> com.genesys.analytics.flink.user.processing.window.AggregateStatusStats.durationStats >> to java.lang.Double >> at >> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167) >> at >> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171) >> at >> java.base/jdk.internal.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:81) >> at java.base/java.lang.reflect.Field.set(Field.java:799) >> at >> org.apache.flink.api.java.typeutils.runtime.PojoSerializer.deserialize(PojoSerializer.java:465) >> at >> org.apache.flink.contrib.streaming.state.RocksDBValueState.value(RocksDBValueState.java:88) >> This one almost looks like one ValueState<A> overwrote the other >> ValueState<B> >> >> >> >> Our setup is Flink 1.19.1 deployed on AWS EC2 with RocksDB as the state >> backend and S3 for checkpoint/savepoint storage. We’ve had very few issues >> with this since 2017, but multiple incidents in the last few months. Is >> there anything that changed on the Flink side? >> >> >> >> >> Peter Westermann >> >> Director, Analytics Software Architect >> >> Genesys Cloud - Analytics >> >> [image: signature_1853126556] >> >> peter.westerm...@genesys.com <richard.sch...@genesys.com> >> >> [image: signature_1853126556] >> >> [image: signature_414073906] <http://www.genesys.com/> >> >> [image: signature_1141770227] <https://twitter.com/Genesys>[image: >> signature_1333433069] >> <http://www.linkedin.com/company/601919?trk=tyah>[image: >> signature_769303212] >> <https://plus.google.com/+Genesyslab?rel=publisher>[image: >> signature_561365254] <https://www.facebook.com/Genesys/>[image: >> signature_262758434] <https://www.youtube.com/Genesys>[image: >> signature_2090310858] <http://blog.genesys.com/> >> >> >> >> >> >> >> >