Is it related to state size or can happen with small states too?
Are you using async io [1]?

Some pattern would be useful, otherwise hard to do anything🤷🏻‍♂️

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/operators/asyncio/

BR,
G


On Fri, Feb 14, 2025 at 6:17 PM Peter Westermann <no.westerm...@genesys.com>
wrote:

> Since this doesn’t happen consistently, we were never able to reproduce
> this. Just across a large number of various Flink jobs, we run into this
> sporadically (and since this is in our production environment, we can’t
> export savepoints/checkpoints for further analysis).  My guess would be
> that there’s a race condition somewhere that may cause this.
>
>
>
>
>
> This may be coincidence, but it seems that all state corruption issues
> occurred in operators that contain multiple ValueStates (or a MapState and
> a ValueState).
>
>
>
>
> Peter Westermann
>
> Director, Analytics Software Architect
>
> Genesys Cloud - Analytics
>
> [image: signature_1853126556]
>
> peter.westerm...@genesys.com <richard.sch...@genesys.com>
>
> [image: signature_1853126556]
>
> [image: signature_414073906] <http://www.genesys.com/>
>
> [image: signature_1141770227] <https://twitter.com/Genesys>[image:
> signature_1333433069] <http://www.linkedin.com/company/601919?trk=tyah>[image:
> signature_769303212] 
> <https://plus.google.com/+Genesyslab?rel=publisher>[image:
> signature_561365254] <https://www.facebook.com/Genesys/>[image:
> signature_262758434] <https://www.youtube.com/Genesys>[image:
> signature_2090310858] <http://blog.genesys.com/>
>
>
>
>
>
>
>
>
>
> *From: *Gabor Somogyi <gabor.g.somo...@gmail.com>
> *Date: *Friday, February 14, 2025 at 11:54 AM
> *To: *Peter Westermann <no.westerm...@genesys.com>
> *Cc: *user@flink.apache.org <user@flink.apache.org>
> *Subject: *Re: Issues with state corruption in RocksDB
>
> * EXTERNAL EMAIL - Please use caution with links and attachments *
>
>
> ------------------------------
>
> BTW, do you have a bare minimum application which can reproduce the issue?
>
>
>
> On Fri, Feb 14, 2025 at 5:07 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
> wrote:
>
> Hi Peter,
>
>
>
> I've had a look on this and seems like this problem or a similar one still
> exists in 1.20.
>
> Since I've not done in-depth analysis I can only say this by taking a look
> at the last comments, like:
>
> > Faced an issue on Flink 1.20.0
>
>
>
> but there are more of such...
>
>
>
> So all in all I've the feeling that this area still has some edge case or
> race issue(s) left.
>
>
>
> BR,
>
> G
>
>
>
>
>
> On Tue, Jan 7, 2025 at 4:26 PM Peter Westermann <no.westerm...@genesys.com>
> wrote:
>
> In the past, we have infrequently had issues with corrupted timer state
> (see https://issues.apache.org/jira/browse/FLINK-23886), which resolve,
> when we go back to a previous savepoint and restart the job from that.
> Recently (since we upgraded to Flink 1.19.1), we have seen more state
> corruption issues for checkpoint/savepoint data in RocksDB. Each time, the
> resolution has been to go back to a previous savepoint and replay events
> and everything was fine.
>
> Issues we have seen:
>
>    - Exceptions when deserializing MapState entries, with an EOFException
>    at the start of key deserialization:
>
> j.io.EOFException: null
>
>      at
> o.a.f.c.m.DataInputDeserializer.readUnsignedShort(DataInputDeserializer.java:339)
>
>      at
> o.a.f.c.m.DataInputDeserializer.readUTF(DataInputDeserializer.java:251)
>
>      at
> c.g.a.f.o.t.s.ObservationKeySerializer.deserialize(ObservationKeySerializer.java:68)
>
>      at
> c.g.a.f.o.t.s.ObservationKeySerializer.deserialize(ObservationKeySerializer.java:17)
>
>      at
> o.a.f.c.s.s.RocksDBMapState.deserializeUserKey(RocksDBMapState.java:390)
>
>      at o.a.f.c.s.s.RocksDBMapState.access$000(RocksDBMapState.java:66)
>
>      at
> o.a.f.c.s.s.RocksDBMapState$RocksDBMapEntry.getKey(RocksDBMapState.java:493)
>
>      ... 27 common frames omitted
>
> Wrapped by: o.a.f.u.FlinkRuntimeException: Error while deserializing the
> user key.
>
>      at
> o.a.f.c.s.s.RocksDBMapState$RocksDBMapEntry.getKey(RocksDBMapState.java:496)
>
>      at
> c.g.a.f.o.t.ObservationsWrapper.filterObservations(ObservationsWrapper.java:83)
>
>    - Same with ValueState, with an EOFException immediatel when
>    deserializing
>
> java.io.EOFException: null
>
> at
> o.a.f.c.m.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:329)
>
> at o.a.f.t.StringValue.readString(StringValue.java:781)
>
> at o.a.f.a.c.t.b.StringSerializer.deserialize(StringSerializer.java:73)
>
> at o.a.f.a.c.t.b.StringSerializer.deserialize(StringSerializer.java:31)
>
> at o.a.f.a.j.t.r.PojoSerializer.deserialize(PojoSerializer.java:463)
>
> at o.a.f.c.s.s.RocksDBValueState.value(RocksDBValueState.java:88)
>
> ... 23 common frames omitted
>
>    - Exceptions for operators with multiple ValueStates, where
>    deserialization fails due to type issues:
>
> java.lang.IllegalArgumentException: Can not set java.util.Map field
> com.genesys.analytics.flink.user.processing.window.AggregateStatusStats.durationStats
> to java.lang.Double
>     at
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
>     at
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
>     at
> java.base/jdk.internal.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:81)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at
> org.apache.flink.api.java.typeutils.runtime.PojoSerializer.deserialize(PojoSerializer.java:465)
>     at
> org.apache.flink.contrib.streaming.state.RocksDBValueState.value(RocksDBValueState.java:88)
> This one almost looks like one ValueState<A>  overwrote the other
> ValueState<B>
>
>
>
> Our setup is Flink 1.19.1 deployed on AWS EC2 with RocksDB as the state
> backend and S3 for checkpoint/savepoint storage. We’ve had very few issues
> with this since 2017, but multiple incidents in the last few months. Is
> there anything that changed on the Flink side?
>
>
>
>
> Peter Westermann
>
> Director, Analytics Software Architect
>
> Genesys Cloud - Analytics
>
> [image: signature_1853126556]
>
> peter.westerm...@genesys.com <richard.sch...@genesys.com>
>
> [image: signature_1853126556]
>
> [image: signature_414073906] <http://www.genesys.com/>
>
> [image: signature_1141770227] <https://twitter.com/Genesys>[image:
> signature_1333433069] <http://www.linkedin.com/company/601919?trk=tyah>[image:
> signature_769303212] 
> <https://plus.google.com/+Genesyslab?rel=publisher>[image:
> signature_561365254] <https://www.facebook.com/Genesys>[image:
> signature_262758434] <https://www.youtube.com/Genesys>[image:
> signature_2090310858] <http://blog.genesys.com/>
>
>
>
>
>
>
>
>

Reply via email to