Some addtitional info: It doesn't seem to happen the first time I start the jobs / restore them from a savepoint. It happens as jobs are failing over after a task manager failure.
This could be an issue caused by a non-empty rocks directory (that was somehow in an inconsistent state) but that should not happen as the instanceDbPath is deleted before opening. Gyula Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2016. aug. 25., Cs, 23:28): > Stephan, > > I ported the fix for the concurrency issue from the Flink commit so now > that should be fine. I ran some fail/restore tests and that specific issue > hasn't appeared again. > > However I now get many segfaults in the initializeForJob method where the > RocksDb instance is opened. Just for the record this is the same exact code > as we have in Flink now.: > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007f12b018f51f, pid=12576, tid=139668190197504 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 ) > # Problematic frame: > # C [libc.so.6+0x7b51f] > ... > Stack: [0x00007f0708ccf000,0x00007f0708dd0000], sp=0x00007f0708dccd20, > free space=1015k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > C [libc.so.6+0x7b51f] > > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) > j > > org.rocksdb.RocksDB.open(JLjava/lang/String;Ljava/util/List;I)Ljava/util/List;+0 > j > > org.rocksdb.RocksDB.open(Lorg/rocksdb/DBOptions;Ljava/lang/String;Ljava/util/List;Ljava/util/List;)Lorg/rocksdb/RocksDB;+23 > j > com.king.rbea.backend.state.rocksdb.RocksDBStateBackend.initializeForJob... > > And this happens fairly frequently when the jobs are restarting after > failure. > > Cheers, > Gyula > > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2016. aug. 25., Cs, > 19:07): > >> Yes seems like that, I remember the fix in Flink. I apparently made a >> mistake somewhere in our code :) >> >> Thanks, >> Gyula >> >> On Thu, Aug 25, 2016, 18:59 Stephan Ewen <se...@apache.org> wrote: >> >>> We saw some crashes in earlier versions when native handles in RocksDB >>> (even for config option objects) were manually and too eagerly released. >>> >>> Maybe you have a similar issue here? >>> >>> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra <gyula.f...@gmail.com> >>> wrote: >>> >>> > Hi, >>> > This seems to be a sneaky concurrency issue in our custom statebackend >>> > implementation. >>> > >>> > I made some changes, will keep you posted. >>> > >>> > Cheers, >>> > Gyula >>> > >>> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra <gyula.f...@gmail.com> wrote: >>> > >>> > > Hi, >>> > > >>> > > Sure I am sending the TM logs in priv. >>> > > >>> > > Currently what I did was to bump the Rocks version to 4.9.0 let's >>> see if >>> > > that helps. >>> > > >>> > > Cheers, >>> > > Gyula >>> > > >>> > > Till Rohrmann <trohrm...@apache.org> ezt írta (időpont: 2016. aug. >>> 25., >>> > > Cs, 10:35): >>> > > >>> > >> Hi Gyula, >>> > >> >>> > >> I haven't seen this problem before. Do you have the logs of the >>> failed >>> > TMs >>> > >> so that we have some more context what was going on? >>> > >> >>> > >> Cheers, >>> > >> Till >>> > >> >>> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <gyf...@apache.org> >>> wrote: >>> > >> >>> > >> > Hi guys, >>> > >> > >>> > >> > For quite some time now we fairly frequently experience a task >>> manager >>> > >> > crashes around the time new streaming jobs are deployed. We use >>> > RocksDB >>> > >> > backend so this might be related. >>> > >> > >>> > >> > We tried changing the GC from G1 to CMS that didnt help. >>> > >> > >>> > >> > Yesterday for instance 6 task managers crashed one ofter the other >>> > with >>> > >> > similar errors: >>> > >> > >>> > >> > *** Error in `java': double free or corruption (!prev): >>> > >> 0x00007fac0414d760 >>> > >> > *** >>> > >> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0 >>> *** >>> > >> > *** Error in `java': double free or corruption (!prev): >>> > >> 0x00007f15247f9a90 >>> > >> > *** >>> > >> > ... >>> > >> > >>> > >> > Does anyone have any clue what might cause this or how to debug? >>> > >> > This is very a critical issue :( >>> > >> > >>> > >> > Cheers, >>> > >> > Gyula >>> > >> > >>> > >> >>> > > >>> > >>> >>