Re: Checkpointing SIGSEGV

Stefan Richter Mon, 29 May 2017 05:55:53 -0700

FYI, I created this JIRA https://issues.apache.org/jira/browse/FLINK-6761 
<https://issues.apache.org/jira/browse/FLINK-6761> to track the problem of 
large merging state per key. I might also bring this the the RocksDB issue 
tracker and then figure out how to solve this.


> Am 27.05.2017 um 20:28 schrieb Stefan Richter <s.rich...@data-artisans.com>:
> 
> Hi,
> 
> this is a known and currently „accepted“ problem in Flink which can only 
> happen when a task manager is already going down, e.g. on cancelation. It 
> happens when the RocksDB object was already disposed (as part of the shutdown 
> procedure) but there is still a pending timer firing, and in the process 
> accessing the released native resource. 
> 
> Background why it is like that: waiting for all timer to finish could 
> potentially take time and we want shutdown to be as fast as possible so that 
> we can bring up the task again asap.
> 
> Best,
> Stefan
> 
>> Am 27.05.2017 um 16:19 schrieb Yassine MARZOUGUI <y.marzou...@mindlytix.com 
>> <mailto:y.marzou...@mindlytix.com>>:
>> 
>> Hi,
>> 
>> This might be related, I'm experiencing a similar SIGSEGV causing the 
>> taskmanager to die each time my streaming job fails or even if I cancel it 
>> manually.
>> I am using Flink1.4-SNAPSHOT, Commit: 546e2ad.
>> 
>> Here are some examples (see the full dump attached):
>> 
>> On Job failure:
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGSEGV (0xb) at pc=0x00007f244f4cf5b3, pid=9532, tid=0x00007f24474fc700
>> #
>> # JRE version: OpenJDK Runtime Environment (8.0_121-b13) (build 
>> 1.8.0_121-b13)
>> # Java VM: OpenJDK 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 
>> compressed oops)
>> # Problematic frame:
>> # C  [librocksdbjni-linux64.so+0x1dd5b3]  
>> rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::string*, 
>> bool*)+0xe3
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable core 
>> dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.java.com/bugreport/crash.jsp 
>> <http://bugreport.java.com/bugreport/crash.jsp>
>> # The crash happened outside the Java Virtual Machine in native code.
>> # See problematic frame for where to report the bug.
>> #
>> 
>> On job Cancel:
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGSEGV (0xb) at pc=0x00007fdd2815c280, pid=4503, tid=0x00007fdc14d48700
>> #
>> # JRE version: OpenJDK Runtime Environment (8.0_121-b13) (build 
>> 1.8.0_121-b13)
>> # Java VM: OpenJDK 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 
>> compressed oops)
>> # Problematic frame:
>> # C  0x00007fdd2815c280
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable core 
>> dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.java.com/bugreport/crash.jsp 
>> <http://bugreport.java.com/bugreport/crash.jsp>
>> # The crash happened outside the Java Virtual Machine in native code.
>> # See problematic frame for where to report the bug.
>> #
>> 
>> Side question : Isn't the jobmanager supposed to restart the taskmnanager if 
>> it is lost/killed?
>> 
>> Best,
>> Yassine
>> 
>> 
>> 2017-05-27 1:32 GMT+02:00 Stefan Richter <s.rich...@data-artisans.com 
>> <mailto:s.rich...@data-artisans.com>>:
>> Flink’s version is hosted here: https://github.com/dataArtisans/frocksdb 
>> <https://github.com/dataArtisans/frocksdb>
>> 
>>> Am 26.05.2017 um 19:59 schrieb Jason Brelloch <jb.bc....@gmail.com 
>>> <mailto:jb.bc....@gmail.com>>:
>>> 
>>> Thanks for looking into this Stefan.  We are moving forward with a 
>>> different strategy for now.  If I want to take a look at this, where do I 
>>> go to get the Flink version of RocksDB?
>>> 
>>> On Fri, May 26, 2017 at 1:06 PM, Stefan Richter 
>>> <s.rich...@data-artisans.com <mailto:s.rich...@data-artisans.com>> wrote:
>>> I forgot to mention that you need to run this with Flink’s version of 
>>> RocksDB, as the stock version is already unable to perform the inserts 
>>> because their implementation of merge operator has a performance problem.
>>> 
>>> Furthermore, I think a higher multiplicator than *2 is required on num 
>>> (and/or a smaller modulo on the key bytes) to trigger the problem; Noticed 
>>> that I ran it multiple times, so it added up to bigger sizes over the runs.
>>> 
>>>> Am 26.05.2017 um 18:42 schrieb Stefan Richter <s.rich...@data-artisans.com 
>>>> <mailto:s.rich...@data-artisans.com>>:
>>>> 
>>>> I played a bit around with your info and this looks now like a general 
>>>> problem in RocksDB to me. Or more specifically, between RocksDB and the 
>>>> JNI bridge. I could reproduce the issue with the following simple test 
>>>> code:
>>>> 
>>>> File rocksDir = new File("/tmp/rocks");
>>>> final Options options = new Options()
>>>>    .setCreateIfMissing(true)
>>>>    .setMergeOperator(new StringAppendOperator())
>>>>    .setCompactionStyle(CompactionStyle.LEVEL)
>>>>    .setLevelCompactionDynamicLevelBytes(true)
>>>>    .setIncreaseParallelism(4)
>>>>    .setUseFsync(false)
>>>>    .setMaxOpenFiles(-1)
>>>>    .setAllowOsBuffer(true)
>>>>    .setDisableDataSync(true);
>>>> 
>>>> final WriteOptions write_options = new WriteOptions()
>>>>    .setSync(false)
>>>>    .setDisableWAL(true);
>>>> 
>>>> try (final RocksDB rocksDB = RocksDB.open(options, 
>>>> rocksDir.getAbsolutePath())) {
>>>>    final String key = "key";
>>>>    final String value = 
>>>> "abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ7890654321";
>>>> 
>>>>    byte[] keyBytes = key.getBytes(StandardCharsets.UTF_8);
>>>>    keyBytes = Arrays.copyOf(keyBytes, keyBytes.length + 1);
>>>>    final byte[] valueBytes = value.getBytes(StandardCharsets.UTF_8);
>>>>    final int num = (Integer.MAX_VALUE / valueBytes.length) * 2;
>>>> 
>>>>    System.out.println("begin insert");
>>>> 
>>>>    final long beginInsert = System.nanoTime();
>>>>    for (int i = 0; i < num; i++) {
>>>>       keyBytes[keyBytes.length - 1] = (byte) (i % 9);
>>>>       rocksDB.merge(write_options, keyBytes, valueBytes);
>>>>    }
>>>>    final long endInsert = System.nanoTime();
>>>>    System.out.println("end insert - duration: " + ((endInsert - 
>>>> beginInsert) / 1_000_000) + " ms");
>>>> 
>>>>    final long beginGet = System.nanoTime();
>>>>    try (RocksIterator iterator = rocksDB.newIterator()) {
>>>>       iterator.seekToFirst();
>>>> 
>>>>       while (iterator.isValid()) {
>>>>          iterator.next();
>>>>          byte[] bytes = iterator.value();
>>>>          System.out.println(bytes.length + " " + bytes[bytes.length - 1]);
>>>>       }
>>>>    }
>>>>    final long endGet = System.nanoTime();
>>>> 
>>>>    System.out.println("end get - duration: " + ((endGet - beginGet) / 
>>>> 1_000_000) + " ms");
>>>> }
>>>> 
>>>> Depending on how smooth the 1.3 release is going, maybe I find some time 
>>>> next week to take a closer look into this. If this is urgent, please also 
>>>> feel free to already report this problem to the RocksDB issue tracker.
>>>> 
>>>> Best,
>>>> Stefan
>>>> 
>>>>> Am 26.05.2017 um 16:40 schrieb Jason Brelloch <jb.bc....@gmail.com 
>>>>> <mailto:jb.bc....@gmail.com>>:
>>>>> 
>>>>> ~2 GB was the total state in the backend.  The total number of keys in 
>>>>> the test is 10 with an approximately even distribution of state across 
>>>>> keys, and parallelism of 1 so all keys are on the same taskmanager.  We 
>>>>> are using ListState and the number of elements per list would be about 
>>>>> 500000.
>>>>> 
>>>>> On Fri, May 26, 2017 at 10:20 AM, Stefan Richter 
>>>>> <s.rich...@data-artisans.com <mailto:s.rich...@data-artisans.com>> wrote:
>>>>> Hi,
>>>>> 
>>>>> what means „our state“ in this context? The total state in the backend or 
>>>>> the state under one key? If you use, e.g. list state, I could see that 
>>>>> the state for one key can grow above 2GB, but once we retrieve the state 
>>>>> back from RocksDB as Java arrays (in your stacktrace, when making a 
>>>>> checkpoint), which are bounded in size to a maximum of 2GB 
>>>>> (Integer.MAX_VALUE) and maybe that is what happens in JNI if you try to 
>>>>> go beyond that limit. Could that be a reason for your problem?
>>>>> 
>>>>>> Am 26.05.2017 um 15:50 schrieb Robert Metzger <rmetz...@apache.org 
>>>>>> <mailto:rmetz...@apache.org>>:
>>>>>> 
>>>>>> Hi Jason,
>>>>>> 
>>>>>> This error is unexpected. I don't think its caused by insufficient 
>>>>>> memory. I'm including Stefan into the conversation, he's the RocksDB 
>>>>>> expert :)
>>>>>> 
>>>>>> On Thu, May 25, 2017 at 4:15 PM, Jason Brelloch <jb.bc....@gmail.com 
>>>>>> <mailto:jb.bc....@gmail.com>> wrote:
>>>>>> Hey guys,
>>>>>> 
>>>>>> We are running into a JVM crash on checkpointing when our rocksDB state 
>>>>>> reaches a certain size on a taskmanager (about 2GB).  The issue happens 
>>>>>> with both a hadoop backend and just writing to a local file.
>>>>>> 
>>>>>> We are running on Flink 1.2.1.
>>>>>> 
>>>>>> #
>>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>>> #
>>>>>> #  SIGSEGV (0xb) at pc=0x00007febf4261b42, pid=1, tid=0x00007fead135f700
>>>>>> #
>>>>>> # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 
>>>>>> 1.8.0_131-b11)
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode 
>>>>>> linux-amd64 compressed oops)
>>>>>> # Problematic frame:
>>>>>> # V  [libjvm.so+0x6d1b42]  jni_SetByteArrayRegion+0xc2
>>>>>> #
>>>>>> # Core dump written. Default location: //core or core.1
>>>>>> #
>>>>>> # An error report file with more information is saved as:
>>>>>> # /tmp/hs_err_pid1.log
>>>>>> #
>>>>>> # If you would like to submit a bug report, please visit:
>>>>>> #   http://bugreport.java.com/bugreport/crash.jsp 
>>>>>> <http://bugreport.java.com/bugreport/crash.jsp>
>>>>>> #
>>>>>> 
>>>>>> Is this an issue with not enough memory?  Or maybe not enough allocated 
>>>>>> to rocksDB?
>>>>>> 
>>>>>> I have attached the taskmanager logs, and the core dump.  The jobmanager 
>>>>>> logs just say taskmanger lost/killed.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> -- 
>>>>>> Jason Brelloch | Product Developer
>>>>>> 3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305 
>>>>>>  <http://www.bettercloud.com/>
>>>>>> Subscribe to the BetterCloud Monitor 
>>>>>> <https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch>
>>>>>>  - Get IT delivered to your inbox
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jason Brelloch | Product Developer
>>>>> 3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305 
>>>>>  <http://www.bettercloud.com/>
>>>>> Subscribe to the BetterCloud Monitor 
>>>>> <https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch>
>>>>>  - Get IT delivered to your inbox
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Jason Brelloch | Product Developer
>>> 3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305 
>>>  <http://www.bettercloud.com/>
>>> Subscribe to the BetterCloud Monitor 
>>> <https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch>
>>>  - Get IT delivered to your inbox
>> 
>> 
>> <hs_err_pid4503.log><hs_err_pid9532.log>
>

Re: Checkpointing SIGSEGV

Reply via email to