- I will increase the jvm-overhead - I don't have any failovers or restarts until it starts happening - If it happens again even with the changes, I'll post the NMT output
On Fri, Oct 30, 2020 at 3:54 AM Xintong Song <tonysong...@gmail.com> wrote: > Hi Ori, > > I'm not sure about where the problem comes from. There are several things > that might worse a try. > - Further increasing the `jvm-overhead`. Your `ps` result suggests that > the Flink process uses 120+GB, while `process.size` is configured 112GB. So > I think 2GB `jvm-overhead` might not be enough. I would suggest to tune > `managed.fraction` back to 0.4 and increase `jvm-overhead` to around 12GB. > This should give you roughly the same `process.size` as before, while > leaving more unmanaged native memory space. > - During the 7-10 job running days, are there any failovers/restarts? If > yes, you might want to look into this comment [1] in FLINK-18712. > - If neither of the above actions helps, we might need to leverage tools > (e.g., JVM NMT [2]) to track the native memory usages and see where exactly > the leak comes from. > > Thank you~ > > Xintong Song > > > [1] > https://issues.apache.org/jira/browse/FLINK-18712?focusedCommentId=17189138&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17189138 > > [2] > https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html > > On Thu, Oct 29, 2020 at 7:51 PM Ori Popowski <ori....@gmail.com> wrote: > >> >> Hi Xintong, >> >> Unfortunately I cannot upgrade to 1.10.2, because EMR has either 1.10.0 >> or 1.11.0. >> >> About the overhead - turns out I already configured >> taskmanager.memory.jvm-overhead.max to 2 gb instead of the default 1 gb. >> Should I increase it further? >> >> state.backend.rocksdb.memory.managed is already not explicitly >> configured. >> >> Is there anything else I can do? >> >> >> >> On Thu, Oct 29, 2020 at 1:24 PM Xintong Song <tonysong...@gmail.com> >> wrote: >> >>> Hi Ori, >>> >>> RocksDB also uses managed memory. If the memory overuse indeed comes >>> from RocksDB, then increasing managed memory fraction will not help. >>> RocksDB will try to use as many memory as the configured managed memory >>> size. Therefore increasing managed memory fraction also makes RocksDB try >>> to use more memory. That is why I suggested increasing `jvm-overhead` >>> instead. >>> >>> Please also make sure the configuration option >>> `state.backend.rocksdb.memory.managed` is either not explicitly configured, >>> or configured to `true`. >>> >>> In addition, I noticed that you are using Flink 1.10.0. You might want >>> to upgrade to 1.10.2, to include the latest bug fixes on the 1.10 release. >>> >>> Thank you~ >>> >>> Xintong Song >>> >>> >>> >>> On Thu, Oct 29, 2020 at 4:41 PM Ori Popowski <ori....@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> PID 20331 is indeed the Flink process, specifically the TaskManager >>>> process. >>>> >>>> - Workload is a streaming workload reading from Kafka and writing to S3 >>>> using a custom Sink >>>> - RockDB state backend is used with default settings >>>> - My external dependencies are: >>>> -- logback >>>> -- jackson >>>> -- flatbuffers >>>> -- jaxb-api >>>> -- scala-java8-compat >>>> -- apache commons-io >>>> -- apache commons-compress >>>> -- software.amazon.awssdk s3 >>>> - What do you mean by UDFs? I've implemented several operators like >>>> KafkaDeserializationSchema, FlatMap, Map, ProcessFunction. >>>> >>>> We use a SessionWindow with 30 minutes of gap, and a watermark with 10 >>>> minutes delay. >>>> >>>> We did confirm we have some keys in our job which keep receiving >>>> records indefinitely, but I'm not sure why it would cause a managed memory >>>> leak, since this should be flushed to RocksDB and free the memory used. We >>>> have a guard against this, where we keep the overall size of all the >>>> records for each key, and when it reaches 300mb, we don't move the records >>>> downstream, which causes them to create a session and go through the sink. >>>> >>>> About what you suggested - I kind of did this by increasing the managed >>>> memory fraction to 0.5. And it did postpone the occurrence of the problem >>>> (meaning, the TMs started crashing after 10 days instead of 7 days). It >>>> looks like anything I'll do on that front will only postpone the problem >>>> but not solve it. >>>> >>>> I am attaching the full job configuration. >>>> >>>> >>>> >>>> On Thu, Oct 29, 2020 at 10:09 AM Xintong Song <tonysong...@gmail.com> >>>> wrote: >>>> >>>>> Hi Ori, >>>>> >>>>> It looks like Flink indeed uses more memory than expected. I assume >>>>> the first item with PID 20331 is the flink process, right? >>>>> >>>>> It would be helpful if you can briefly introduce your workload. >>>>> - What kind of workload are you running? Streaming or batch? >>>>> - Do you use RocksDB state backend? >>>>> - Any UDFs or 3rd party dependencies that might allocate significant >>>>> native memory? >>>>> >>>>> Moreover, if the metrics shows only 20% heap usages, I would suggest >>>>> configuring less `task.heap.size`, leaving more memory to off-heap. The >>>>> reduced heap size does not necessarily all go to the managed memory. You >>>>> can also try increasing the `jvm-overhead`, simply to leave more native >>>>> memory in the container in case there are other other significant native >>>>> memory usages. >>>>> >>>>> Thank you~ >>>>> >>>>> Xintong Song >>>>> >>>>> >>>>> >>>>> On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <ori....@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Xintong, >>>>>> >>>>>> See here: >>>>>> >>>>>> # Top memory users >>>>>> ps auxwww --sort -rss | head -10 >>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME >>>>>> COMMAND >>>>>> yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 >>>>>> /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max >>>>>> root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 >>>>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X >>>>>> hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 >>>>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X >>>>>> yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 >>>>>> /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf >>>>>> root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 >>>>>> /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea >>>>>> root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 >>>>>> /usr/lib/systemd/systemd-journald >>>>>> root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 >>>>>> /usr/bin/amazon-ssm-agent >>>>>> root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps >>>>>> auxwww --sort -rss >>>>>> root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 >>>>>> /usr/sbin/CROND -n >>>>>> >>>>>> On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <tonysong...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Ori, >>>>>>> >>>>>>> The error message suggests that there's not enough physical memory >>>>>>> on the machine to satisfy the allocation. This does not necessarily >>>>>>> mean a >>>>>>> managed memory leak. Managed memory leak is only one of the >>>>>>> possibilities. >>>>>>> There are other potential reasons, e.g., another process/container on >>>>>>> the >>>>>>> machine used more memory than expected, Yarn NM is not configured with >>>>>>> enough memory reserved for the system processes, etc. >>>>>>> >>>>>>> I would suggest to first look into the machine memory usages, see >>>>>>> whether the Flink process indeed uses more memory than expected. This >>>>>>> could >>>>>>> be achieved via: >>>>>>> - Run the `top` command >>>>>>> - Look into the `/proc/meminfo` file >>>>>>> - Any container memory usage metrics that are available to your Yarn >>>>>>> cluster >>>>>>> >>>>>>> Thank you~ >>>>>>> >>>>>>> Xintong Song >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <ori....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> After the job is running for 10 days in production, TaskManagers >>>>>>>> start failing with: >>>>>>>> >>>>>>>> Connection unexpectedly closed by remote task manager >>>>>>>> >>>>>>>> Looking in the machine logs, I can see the following error: >>>>>>>> >>>>>>>> ============= Java processes for user hadoop ============= >>>>>>>> OpenJDK 64-Bit Server VM warning: INFO: >>>>>>>> os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; >>>>>>>> error='Cannot >>>>>>>> allocate memory' (err >>>>>>>> # >>>>>>>> # There is insufficient memory for the Java Runtime Environment to >>>>>>>> continue. >>>>>>>> # Native memory allocation (mmap) failed to map 1006567424 bytes >>>>>>>> for committing reserved memory. >>>>>>>> # An error report file with more information is saved as: >>>>>>>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >>>>>>>> =========== End java processes for user hadoop =========== >>>>>>>> >>>>>>>> In addition, the metrics for the TaskManager show very low Heap >>>>>>>> memory consumption (20% of Xmx). >>>>>>>> >>>>>>>> Hence, I suspect there is a memory leak in the TaskManager's >>>>>>>> Managed Memory. >>>>>>>> >>>>>>>> This my TaskManager's memory detail: >>>>>>>> flink process 112g >>>>>>>> framework.heap.size 0.2g >>>>>>>> task.heap.size 50g >>>>>>>> managed.size 54g >>>>>>>> framework.off-heap.size 0.5g >>>>>>>> task.off-heap.size 1g >>>>>>>> network 2g >>>>>>>> XX:MaxMetaspaceSize 1g >>>>>>>> >>>>>>>> As you can see, the managed memory is 54g, so it's already high (my >>>>>>>> managed.fraction is set to 0.5). >>>>>>>> >>>>>>>> I'm running Flink 1.10. Full job details attached. >>>>>>>> >>>>>>>> Can someone advise what would cause a managed memory leak? >>>>>>>> >>>>>>>> >>>>>>>>