Re: Flink 1.1.3 OOME Permgen

Konstantin Knauf Wed, 30 Nov 2016 08:51:44 -0800

Hi Stefan,

unfortunately, I can not share any heap dumps with you. I was able to
resolve some of the issues my self today, the root causes were different
for different jobs.

1) Jackson 2.7.2 (which comes with Flink) has a known class loading
issue (see https://github.com/FasterXML/jackson-databind/issues/1363).
Shipping a shaded version of Jackson 2.8.4 with our user code helped. I
recommend upgrading Flink's Jackson version soon.

2) We have a dependency on the flink-table [1] , which ships with
Calcite including the Calcite JDBC Driver, which can not been collected
cause of the known problem with the java.sql.DriverManager. Putting the
flink-table in Flink's lib dir instead of shipping it with the user code
helps. You should update the documentation, because this will always
happen when using flink-table, I think. So I wonder, why this has not
come up before actually.

3) Unresolved: Some Threads in a custom source which are not proberly
shut down and keep references to the UserCodeClassLoader. I did not have
time to look into this issue so far.

Cheers,

Konstantin

[1] Side note: We only need flink-table for the "Row" class used in the
JdbcOutputFormat, so it might make sense to move this class somewhere
else. Naturally, we also tried to exclude the "transitive" dependency on
org.apache.calcite until we noticed that calcite is packaged with
flink-table, so that you can not even exclude it. What is the reasons
for this?

On 30.11.2016 00:55, Stefan Richter wrote:
> Hi,
> 
> could you somehow provide us a heap dump from a TM that run for a while 
> (ideally, shortly before an OOME)? This would greatly help us to figure out 
> if there is a classloader leak that causes the problem.
> 
> Best,
> Stefan
> 
>> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf 
>> <konstantin.kn...@tngtech.com>:
>>
>> Hi everyone, 
>>
>> since upgrading to Flink 1.1.3 we observe frequent OOME Permgen Taskmanager 
>> Failures. Monitoring the permgen size on one of the Taskamanagers you can 
>> see that each Job (New Job and Restarts) adds a few MB, which can not be 
>> collected. Eventually, the OOME happens. This happens with all our Jobs, 
>> Streaming and Batch, on Yarn 2.4 as well as Stand-Alone. 
>>
>> On Flink 1.0.2 this was not a problem, but I will investigate it further.
>>
>> The assumption is that Flink is somehow using one of the classes, which 
>> comes with our jar and by that prevents the gc of the whole class loader. 
>> Our Jars do not include any flink dependencies though (compileOnly), but of 
>> course many others.
>>
>> Any ideas anyone? 
>>
>> Cheers and thank you, 
>>
>> Konstantin 
>>
>> sent from my phone. Plz excuse brevity and tpyos.
>> ---
>> Konstantin Knauf *konstantin.kn...@tngtech.com * +49-174-3413182
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> 
> 

-- 
Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

signature.asc
Description: OpenPGP digital signature

Re: Flink 1.1.3 OOME Permgen

Reply via email to