Re: Checkpointing SIGSEGV

2017-05-29 Thread Stefan Richter
FYI, I created this JIRA https://issues.apache.org/jira/browse/FLINK-6761 to track the problem of large merging state per key. I might also bring this the the RocksDB issue tracker and then figure out how to solve this. > Am 27.05.2017 um 20:28

Re: Checkpointing SIGSEGV

2017-05-27 Thread Stefan Richter
Hi, this is a known and currently „accepted“ problem in Flink which can only happen when a task manager is already going down, e.g. on cancelation. It happens when the RocksDB object was already disposed (as part of the shutdown procedure) but there is still a pending timer firing, and in the p

Re: Checkpointing SIGSEGV

2017-05-26 Thread Stefan Richter
Flink’s version is hosted here: https://github.com/dataArtisans/frocksdb > Am 26.05.2017 um 19:59 schrieb Jason Brelloch : > > Thanks for looking into this Stefan. We are moving forward with a different > strategy for now. If I want to take a look at

Re: Checkpointing SIGSEGV

2017-05-26 Thread Jason Brelloch
Thanks for looking into this Stefan. We are moving forward with a different strategy for now. If I want to take a look at this, where do I go to get the Flink version of RocksDB? On Fri, May 26, 2017 at 1:06 PM, Stefan Richter wrote: > I forgot to mention that you need to run this with Flink’s

Re: Checkpointing SIGSEGV

2017-05-26 Thread Stefan Richter
I forgot to mention that you need to run this with Flink’s version of RocksDB, as the stock version is already unable to perform the inserts because their implementation of merge operator has a performance problem. Furthermore, I think a higher multiplicator than *2 is required on num (and/or a

Re: Checkpointing SIGSEGV

2017-05-26 Thread Stefan Richter
I played a bit around with your info and this looks now like a general problem in RocksDB to me. Or more specifically, between RocksDB and the JNI bridge. I could reproduce the issue with the following simple test code: File rocksDir = new File("/tmp/rocks"); final Options options = new Options(

Re: Checkpointing SIGSEGV

2017-05-26 Thread Jason Brelloch
~2 GB was the total state in the backend. The total number of keys in the test is 10 with an approximately even distribution of state across keys, and parallelism of 1 so all keys are on the same taskmanager. We are using ListState and the number of elements per list would be about 50. On Fr

Re: Checkpointing SIGSEGV

2017-05-26 Thread Stefan Richter
Hi, what means „our state“ in this context? The total state in the backend or the state under one key? If you use, e.g. list state, I could see that the state for one key can grow above 2GB, but once we retrieve the state back from RocksDB as Java arrays (in your stacktrace, when making a check

Re: Checkpointing SIGSEGV

2017-05-26 Thread Robert Metzger
Hi Jason, This error is unexpected. I don't think its caused by insufficient memory. I'm including Stefan into the conversation, he's the RocksDB expert :) On Thu, May 25, 2017 at 4:15 PM, Jason Brelloch wrote: > Hey guys, > > We are running into a JVM crash on checkpointing when our rocksDB st