BTW, do you have a bare minimum application which can reproduce the issue?

On Fri, Feb 14, 2025 at 5:07 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
wrote:

> Hi Peter,
>
> I've had a look on this and seems like this problem or a similar one still
> exists in 1.20.
> Since I've not done in-depth analysis I can only say this by taking a look
> at the last comments, like:
> > Faced an issue on Flink 1.20.0
>
> but there are more of such...
>
> So all in all I've the feeling that this area still has some edge case or
> race issue(s) left.
>
> BR,
> G
>
>
> On Tue, Jan 7, 2025 at 4:26 PM Peter Westermann <no.westerm...@genesys.com>
> wrote:
>
>> In the past, we have infrequently had issues with corrupted timer state
>> (see https://issues.apache.org/jira/browse/FLINK-23886), which resolve,
>> when we go back to a previous savepoint and restart the job from that.
>> Recently (since we upgraded to Flink 1.19.1), we have seen more state
>> corruption issues for checkpoint/savepoint data in RocksDB. Each time, the
>> resolution has been to go back to a previous savepoint and replay events
>> and everything was fine.
>>
>> Issues we have seen:
>>
>>    - Exceptions when deserializing MapState entries, with an
>>    EOFException at the start of key deserialization:
>>
>> j.io.EOFException: null
>>
>>      at
>> o.a.f.c.m.DataInputDeserializer.readUnsignedShort(DataInputDeserializer.java:339)
>>
>>      at
>> o.a.f.c.m.DataInputDeserializer.readUTF(DataInputDeserializer.java:251)
>>
>>      at
>> c.g.a.f.o.t.s.ObservationKeySerializer.deserialize(ObservationKeySerializer.java:68)
>>
>>      at
>> c.g.a.f.o.t.s.ObservationKeySerializer.deserialize(ObservationKeySerializer.java:17)
>>
>>      at
>> o.a.f.c.s.s.RocksDBMapState.deserializeUserKey(RocksDBMapState.java:390)
>>
>>      at o.a.f.c.s.s.RocksDBMapState.access$000(RocksDBMapState.java:66)
>>
>>      at
>> o.a.f.c.s.s.RocksDBMapState$RocksDBMapEntry.getKey(RocksDBMapState.java:493)
>>
>>      ... 27 common frames omitted
>>
>> Wrapped by: o.a.f.u.FlinkRuntimeException: Error while deserializing the
>> user key.
>>
>>      at
>> o.a.f.c.s.s.RocksDBMapState$RocksDBMapEntry.getKey(RocksDBMapState.java:496)
>>
>>      at
>> c.g.a.f.o.t.ObservationsWrapper.filterObservations(ObservationsWrapper.java:83)
>>
>>    - Same with ValueState, with an EOFException immediatel when
>>    deserializing
>>
>> java.io.EOFException: null
>>
>> at
>> o.a.f.c.m.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:329)
>>
>> at o.a.f.t.StringValue.readString(StringValue.java:781)
>>
>> at o.a.f.a.c.t.b.StringSerializer.deserialize(StringSerializer.java:73)
>>
>> at o.a.f.a.c.t.b.StringSerializer.deserialize(StringSerializer.java:31)
>>
>> at o.a.f.a.j.t.r.PojoSerializer.deserialize(PojoSerializer.java:463)
>>
>> at o.a.f.c.s.s.RocksDBValueState.value(RocksDBValueState.java:88)
>>
>> ... 23 common frames omitted
>>
>>    - Exceptions for operators with multiple ValueStates, where
>>    deserialization fails due to type issues:
>>
>> java.lang.IllegalArgumentException: Can not set java.util.Map field
>> com.genesys.analytics.flink.user.processing.window.AggregateStatusStats.durationStats
>> to java.lang.Double
>>     at
>> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
>>     at
>> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
>>     at
>> java.base/jdk.internal.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:81)
>>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>>     at
>> org.apache.flink.api.java.typeutils.runtime.PojoSerializer.deserialize(PojoSerializer.java:465)
>>     at
>> org.apache.flink.contrib.streaming.state.RocksDBValueState.value(RocksDBValueState.java:88)
>> This one almost looks like one ValueState<A>  overwrote the other
>> ValueState<B>
>>
>>
>>
>> Our setup is Flink 1.19.1 deployed on AWS EC2 with RocksDB as the state
>> backend and S3 for checkpoint/savepoint storage. We’ve had very few issues
>> with this since 2017, but multiple incidents in the last few months. Is
>> there anything that changed on the Flink side?
>>
>>
>>
>>
>> Peter Westermann
>>
>> Director, Analytics Software Architect
>>
>> Genesys Cloud - Analytics
>>
>> [image: signature_1853126556]
>>
>> peter.westerm...@genesys.com <richard.sch...@genesys.com>
>>
>> [image: signature_1853126556]
>>
>> [image: signature_414073906] <http://www.genesys.com/>
>>
>> [image: signature_1141770227] <https://twitter.com/Genesys>[image:
>> signature_1333433069] 
>> <http://www.linkedin.com/company/601919?trk=tyah>[image:
>> signature_769303212] 
>> <https://plus.google.com/+Genesyslab?rel=publisher>[image:
>> signature_561365254] <https://www.facebook.com/Genesys/>[image:
>> signature_262758434] <https://www.youtube.com/Genesys>[image:
>> signature_2090310858] <http://blog.genesys.com/>
>>
>>
>>
>>
>>
>>
>>
>

Reply via email to