@Xintong - out of curiosity, where do you see that this tries to fork a process? I must be overlooking something, I could only see the native method call.
On Fri, Apr 24, 2020 at 4:53 AM Xintong Song <tonysong...@gmail.com> wrote: > @Stephan, > I don't think so. If JVM hits the direct memory limit, you should see the > error message "OutOfMemoryError: Direct buffer memory". > > Thank you~ > > Xintong Song > > > > On Thu, Apr 23, 2020 at 6:11 PM Stephan Ewen <se...@apache.org> wrote: > >> @Xintong and @Lasse could it be that the JVM hits the "Direct Memory" >> limit here? >> Would increasing the "taskmanager.memory.framework.off-heap.size" help? >> >> On Mon, Apr 20, 2020 at 11:02 AM Zahid Rahman <zahidr1...@gmail.com> >> wrote: >> >>> As you can see from the task manager tab of flink web dashboard >>> >>> Physical Memory:3.80 GB >>> JVM Heap Size:1.78 GB >>> Flink Managed Memory:128 MB >>> >>> *Flink is only using 128M MB which can easily cause OOM* >>> *error.* >>> >>> *These are DEFAULT settings.* >>> >>> *I dusted off an old laptop so it only 3.8 GB RAM.* >>> >>> What does your job metrics say ? >>> >>> On Mon, 20 Apr 2020, 07:26 Xintong Song, <tonysong...@gmail.com> wrote: >>> >>>> Hi Lasse, >>>> >>>> From what I understand, your problem is that JVM tries to fork some >>>> native process (if you look at the exception stack the root exception is >>>> thrown from a native method) but there's no enough memory for doing that. >>>> This could happen when either Mesos is using cgroup strict mode for memory >>>> control, or there's no more memory on the machine. Flink cannot prevent >>>> native processes from using more memory. It can only reserve certain amount >>>> of memory for such native usage when requesting worker memory from the >>>> deployment environment (in your case Mesos) and allocating Java heap / >>>> direct memory. >>>> >>>> My suggestion is to try increasing the JVM overhead configuration. You >>>> can leverage the configuration options >>>> 'taskmanager.memory.jvm-overhead.[min|max|fraction]'. See more details in >>>> the documentation[1]. >>>> >>>> Thank you~ >>>> >>>> Xintong Song >>>> >>>> >>>> [1] >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html#taskmanager-memory-jvm-overhead-max >>>> >>>> On Sat, Apr 18, 2020 at 4:02 AM Zahid Rahman <zahidr1...@gmail.com> >>>> wrote: >>>> >>>>> https://betsol.com/java-memory-management-for-java-virtual-machine-jvm/ >>>>> >>>>> Backbutton.co.uk >>>>> ¯\_(ツ)_/¯ >>>>> ♡۶Java♡۶RMI ♡۶ >>>>> Make Use Method {MUM} >>>>> makeuse.org >>>>> <http://www.backbutton.co.uk> >>>>> >>>>> >>>>> On Fri, 17 Apr 2020 at 14:07, Lasse Nedergaard < >>>>> lassenedergaardfl...@gmail.com> wrote: >>>>> >>>>>> Hi. >>>>>> >>>>>> We have migrated to Flink 1.10 and face out of memory exception and >>>>>> hopeful can someone point us in the right direction. >>>>>> >>>>>> We have a job that use broadcast state, and we sometimes get out >>>>>> memory when it creates a savepoint. See stacktrack below. >>>>>> We have assigned 2.2 GB/task manager and >>>>>> configured taskmanager.memory.process.size : 2200m >>>>>> In Flink 1.9 our container was terminated because OOM, so 1.10 do a >>>>>> better job, but it still not working and the task manager is leaking mem >>>>>> for each OOM and finial kill by Mesos >>>>>> >>>>>> >>>>>> Any idea what we can do to figure out what settings we need to change? >>>>>> >>>>>> Thanks in advance >>>>>> >>>>>> Lasse Nedergaard >>>>>> >>>>>> >>>>>> WARN o.a.flink.runtime.state.filesystem.FsCheckpointStreamFactory - >>>>>> Could not close the state stream for >>>>>> s3://flinkstate/dcos-prod/checkpoints/fc9318cc236d09f0bfd994f138896d6c/chk-3509/cf0714dc-ad7c-4946-b44c-96d4a131a4fa. >>>>>> java.io.IOException: Cannot allocate memory at >>>>>> java.io.FileOutputStream.writeBytes(Native Method) at >>>>>> java.io.FileOutputStream.write(FileOutputStream.java:326) at >>>>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at >>>>>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at >>>>>> java.io.FilterOutputStream.flush(FilterOutputStream.java:140) at >>>>>> java.io.FilterOutputStream.close(FilterOutputStream.java:158) at >>>>>> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:995) >>>>>> at >>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) >>>>>> at >>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) >>>>>> at >>>>>> org.apache.flink.fs.s3presto.common.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52) >>>>>> at >>>>>> org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64) >>>>>> at >>>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.close(FsCheckpointStreamFactory.java:277) >>>>>> at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) at >>>>>> org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:250) at >>>>>> org.apache.flink.util.AbstractCloseableRegistry.close(AbstractCloseableRegistry.java:122) >>>>>> at >>>>>> org.apache.flink.runtime.state.AsyncSnapshotCallable.closeSnapshotIO(AsyncSnapshotCallable.java:167) >>>>>> at >>>>>> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:83) >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at >>>>>> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:458) >>>>>> at >>>>>> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:1143) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>> >>>>>> INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - >>>>>> Discarding checkpoint 3509 of job fc9318cc236d09f0bfd994f138896d6c. >>>>>> org.apache.flink.util.SerializedThrowable: Could not materialize >>>>>> checkpoint >>>>>> 3509 for operator Feature extraction (8/12). at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:1238) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:1180) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>> at java.lang.Thread.run(Thread.java:748) Caused by: >>>>>> org.apache.flink.util.SerializedThrowable: java.io.IOException: Cannot >>>>>> allocate memory at >>>>>> java.util.concurrent.FutureTask.report(FutureTask.java:122) at >>>>>> java.util.concurrent.FutureTask.get(FutureTask.java:192) at >>>>>> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:461) >>>>>> at >>>>>> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:1143) >>>>>> ... 3 common frames omitted Caused by: >>>>>> org.apache.flink.util.SerializedThrowable: Cannot allocate memory at >>>>>> java.io.FileOutputStream.writeBytes(Native Method) at >>>>>> java.io.FileOutputStream.write(FileOutputStream.java:326) at >>>>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at >>>>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:95) at >>>>>> java.io.FilterOutputStream.write(FilterOutputStream.java:77) at >>>>>> java.io.FilterOutputStream.write(FilterOutputStream.java:125) at >>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57) >>>>>> at java.io.DataOutputStream.write(DataOutputStream.java:107) at >>>>>> org.apache.flink.fs.s3presto.common.HadoopDataOutputStream.write(HadoopDataOutputStream.java:47) >>>>>> at >>>>>> org.apache.flink.core.fs.FSDataOutputStreamWrapper.write(FSDataOutputStreamWrapper.java:66) >>>>>> at >>>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.write(FsCheckpointStreamFactory.java:220) >>>>>> at java.io.DataOutputStream.write(DataOutputStream.java:107) at >>>>>> org.apache.flink.formats.avro.utils.DataOutputEncoder.writeBytes(DataOutputEncoder.java:92) >>>>>> at >>>>>> org.apache.flink.formats.avro.utils.DataOutputEncoder.writeString(DataOutputEncoder.java:113) >>>>>> at org.apache.avro.io.Encoder.writeString(Encoder.java:130) at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:323) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:281) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:139) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:144) >>>>>> at >>>>>> org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:98) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:195) >>>>>> at >>>>>> org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:83) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:234) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:136) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:144) >>>>>> at >>>>>> org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:98) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:195) >>>>>> at >>>>>> org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:83) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82) >>>>>> at >>>>>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) >>>>>> at >>>>>> org.apache.flink.formats.avro.typeutils.AvroSerializer.serialize(AvroSerializer.java:185) >>>>>> at >>>>>> org.apache.flink.runtime.state.HeapBroadcastState.write(HeapBroadcastState.java:109) >>>>>> at >>>>>> org.apache.flink.runtime.state.DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackendSnapshotStrategy.java:167) >>>>>> at >>>>>> org.apache.flink.runtime.state.DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackendSnapshotStrategy.java:108) >>>>>> at >>>>>> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:75) >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at >>>>>> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:458) >>>>>> >>>>>