Re: Heap Problem with Checkpoints

Piotr Nowojski Wed, 20 Jun 2018 04:51:16 -0700

Hi,

I was looking in this more, and I have couple of suspicions, but it’s still 
hard to tell which is correct. Could you for example place a breakpoint (or add 
a code there to print a stack trace) in
org.apache.log4j.helpers.AppenderAttachableImpl#addAppender
And check who is calling it? Since it seems like this method is responsible for 
the growing number of ConsoleAppenders consumption.


Piotrek

> On 20 Jun 2018, at 09:20, Fabian Wollert <fab...@zalando.de> wrote:
> 
> Hi Piotr, thx for the hints. I checked the logs of this stack where the 
> previous Heap Dump was from, there are no error messages from the BlobServer, 
> it seems to work properly. 
> 
> But I found another issue in my setup, I had the logging not set up properly, 
> so everything was logging in the default console appender. I changed this now 
> to:
> 
> log4j.rootLogger=INFO, FILE
> log4j.logger.akka=INFO, FILE
> log4j.logger.org.apache.kafka=INFO, FILE
> log4j.logger.org.apache.hadoop=INFO, FILE
> log4j.logger.org.apache.zookeeper=INFO, FILE
> 
> # Log all info in the given file
> log4j.appender.FILE=org.apache.log4j.RollingFileAppender
> log4j.appender.FILE.File=/opt/flink/log/flink.log
> log4j.appender.FILE.MaxFileSize=100MB
> log4j.appender.FILE.MaxBackupIndex=2
> log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
> log4j.appender.FILE.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p 
> %c:%L - %m%n
> 
> # suppress the irrelevant (wrong) warnings from the netty channel handler
> log4j.logger.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, FILE
> 
> though I have this setup now, I still see memory increases, but this time it 
> seems again like my first suspicion is valid:
> 
> <image.png>
> 
> 
> <image.png>
> 
> What I'm here mostly wondering now: Why is still a ConsoleAppender used 
> although I defined RollingFileAppender?
> 
> Sry for the back and forth between different parts of the code. But your help 
> is highly appreciated!
> 
> Cheers
> 
> --
> 
> Fabian Wollert
> Zalando SE
> 
> E-Mail: fabian.woll...@zalando.de
>  <mailto:fabian.woll...@zalando.de>
> 
> Am Di., 19. Juni 2018 um 11:55 Uhr schrieb Piotr Nowojski 
> <pi...@data-artisans.com <mailto:pi...@data-artisans.com>>:
> Hi,
> 
> Can you search the logs/std err/std output for log entries like:
> 
> log.warn("Failed to locally delete blob “ …) ?
> 
> I see in the code, that if file deletion fails for whatever the reason, 
> TransientBlobCleanupTask can loop indefinitely trying to remove it over and 
> over again. That might be ok, however it’s doing it without any back off time 
> as fast as possible.
> 
> To confirm this, could you take couple of thread dumps and check whether some 
> thread is spinning in 
> org.apache.flink.runtime.blob.TransientBlobCleanupTask#run ?
> 
> If that’s indeed a case, the question would be why file deletion fails?
> 
> Piotrek
> 
>> On 18 Jun 2018, at 15:48, Fabian Wollert <fab...@zalando.de 
>> <mailto:fab...@zalando.de>> wrote:
>> 
>> Hi Piotrek, thx a lot for your answer and sry for the late response. I was 
>> running some more tests, but i still got the same problem. I was analyzing a 
>> heap dump already with VisualVM, and thats how i got to the intention that 
>> it was some S3 logging, but seems like i was wrong. on the newer tests, the 
>> heap dump says the following (this time i used Eclipse MemoryAnalyzer): 
>> 
>> <image.png>
>> <image.png>
>> <image.png>
>> Are you aware of problems with the BlobServer not cleaning up properly? I 
>> tried also using a bigger instance, but this never stabilizes, it just keeps 
>> increasing (gave it already 10GB+ Heap) ...
>> 
>> Cheers
>> 
>> --
>> 
>> Fabian Wollert
>> Zalando SE
>> 
>> E-Mail: fabian.woll...@zalando.de <mailto:fabian.woll...@zalando.de>
>> 
>> 
>> 
>> Am Mo., 11. Juni 2018 um 10:46 Uhr schrieb Piotr Nowojski 
>> <pi...@data-artisans.com <mailto:pi...@data-artisans.com>>:
>> Hi,
>> 
>> What kind of messages are those “logs about S3 operations”? Did you try to 
>> google search them? Maybe it’s a known S3 issue?
>> 
>> Another approach is please use some heap space analyser from which you can 
>> backtrack classes that are referencing those “memory leaks” and again try to 
>> google any known memory issues.
>> 
>> It also could just mean, that it’s not a memory leak, but you just need to 
>> allocate more heap space for your JVM (and memory consumption will stabilise 
>> at some point).
>> 
>> Piotrek
>> 
>>> On 8 Jun 2018, at 18:32, Fabian Wollert <fab...@zalando.de 
>>> <mailto:fab...@zalando.de>> wrote:
>>> 
>>> Hi, in this email thread 
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-AWS-S3-integration-java-lang-NullPointerException-null-uri-host-td20413.html>
>>>  here, i tried to set up S3 as a filesystem backend for checkpoints. Now 
>>> everything is working (Flink V1.5.0), but the JobMaster is accumulating 
>>> Heap space, with eventually killing itself with HeapSpace OOM after several 
>>> hours. If I don't enable Checkpointing, then everything is fine. I'm using 
>>> the Flink S3 Shaded Libs (tried both the Hadoop and the Presto lib, no 
>>> difference in this regard) from the tutorial. my checkpoint settings are 
>>> this (job level):
>>> 
>>> env.enableCheckpointing(1000);
>>> env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>>> env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
>>> env.getCheckpointConfig().setCheckpointTimeout(60000);
>>> env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
>>> 
>>> Another clue why i suspect the S3 Checkpointing is that the heapspace dump 
>>> contains a lot of char[] objects with some logs about S3 operations.
>>> 
>>> anyone has an idea where to look further on this?
>>> 
>>> Cheers
>>> 
>>> --
>>> 
>>> Fabian Wollert
>>> Zalando SE
>>> 
>>> E-Mail: fabian.woll...@zalando.de
>>>  <mailto:fabian.woll...@zalando.de>
>>> 
>>> Tamara-Danz-Straße 1
>>> 10243 Berlin
>>> Fax: +49 (0)30 2759 46 93
>>> E-mail: legalnot...@zalando.co.uk <mailto:legalnot...@zalando.co.uk>
>>> Notifications of major holdings (Sec. 33, 38, 39 WpHG):  +49 (0)30 
>>> 2000889349
>>> 
>>> Management Board:
>>> Robert Gentz, David Schneider, Rubin Ritter
>>> 
>>> Chairman of the Supervisory Board:
>>> Lothar Lanz
>>> 
>>> Person responsible for providing the contents of Zalando SE acc. to Art. 55 
>>> RStV [Interstate Broadcasting Agreement]: Rubin Ritter
>>> Registered at the Local Court Charlottenburg Berlin, HRB 158855 B
>>> VAT registration number: DE 260543043
>> 
>

Re: Heap Problem with Checkpoints

Reply via email to