Re: Flink 1.1.3 OOME Permgen

Konstantin Knauf Mon, 05 Dec 2016 03:34:09 -0800

Hi Robert,

you need to actually use Jackson. The problematic field is a cache,
which is filled by all classes, which were serialized/deserialized by
Jackson.


Best,

Konstantin

On 05.12.2016 11:55, Robert Metzger wrote:
> I've submitted Wordcount 410 times to a testing cluster and a streaming
> job 290 times and I could not reproduce the issue with 1.1.3. Also, the
> heapdump of one of the TaskManagers looked pretty normal.
> 
> Do you have any ideas how to reproduce the issue?
> 
> On Fri, Dec 2, 2016 at 3:21 PM, Robert Metzger <rmetz...@apache.org
> <mailto:rmetz...@apache.org>> wrote:
> 
>     Thank you for reporting the issue Konstantin.
>     I've filed a JIRA for the jackson
>     issue: https://issues.apache.org/jira/browse/FLINK-5233
>     <https://issues.apache.org/jira/browse/FLINK-5233>.
>     As I said in the JIRA, I propose to upgrade to Jackson 2.7.8, as
>     this version contains the fix for the issue, but its not a major
>     jackson upgrade.
> 
>     Any chance you could try to if 2.7.8 fixes the issue as well?
> 
> 
>     On Fri, Dec 2, 2016 at 11:12 AM, Fabian Hueske <fhue...@gmail.com
>     <mailto:fhue...@gmail.com>> wrote:
> 
>         Hi Konstantin,
> 
>         Regarding 2): I've opened FLINK-5227 to update the documentation
>         [1].
> 
>         Regarding the Row type: The Row type was introduced for
>         flink-table and was later used by other modules. There is
>         FLINK-5186 to move Row and all the related TypeInfo (+serializer
>         and comparator) to flink-core [2]. That should solve your issue.
> 
>         Some of the connector modules which provide TableSource and
>         TableSinks have dependencies on flink-table as well. I'll check
>         that these are optional dependencies to avoid that we pull in
>         Calcite through connectors for jobs that do not not need it.
> 
>         Thanks,
>         Fabian
> 
>         [1] https://issues.apache.org/jira/browse/FLINK-5227
>         <https://issues.apache.org/jira/browse/FLINK-5227>
>         [2] https://issues.apache.org/jira/browse/FLINK-5186
>         <https://issues.apache.org/jira/browse/FLINK-5186>
> 
>         2016-11-30 17:51 GMT+01:00 Konstantin Knauf
>         <konstantin.kn...@tngtech.com
>         <mailto:konstantin.kn...@tngtech.com>>:
> 
>             Hi Stefan,
> 
>             unfortunately, I can not share any heap dumps with you. I
>             was able to
>             resolve some of the issues my self today, the root causes
>             were different
>             for different jobs.
> 
>             1) Jackson 2.7.2 (which comes with Flink) has a known class
>             loading
>             issue (see
>             https://github.com/FasterXML/jackson-databind/issues/1363
>             <https://github.com/FasterXML/jackson-databind/issues/1363>).
>             Shipping a shaded version of Jackson 2.8.4 with our user
>             code helped. I
>             recommend upgrading Flink's Jackson version soon.
> 
>             2) We have a dependency on the flink-table [1] , which ships
>             with
>             Calcite including the Calcite JDBC Driver, which can not
>             been collected
>             cause of the known problem with the java.sql.DriverManager.
>             Putting the
>             flink-table in Flink's lib dir instead of shipping it with
>             the user code
>             helps. You should update the documentation, because this
>             will always
>             happen when using flink-table, I think. So I wonder, why
>             this has not
>             come up before actually.
> 
>             3) Unresolved: Some Threads in a custom source which are not
>             proberly
>             shut down and keep references to the UserCodeClassLoader. I
>             did not have
>             time to look into this issue so far.
> 
>             Cheers,
> 
>             Konstantin
> 
>             [1] Side note: We only need flink-table for the "Row" class
>             used in the
>             JdbcOutputFormat, so it might make sense to move this class
>             somewhere
>             else. Naturally, we also tried to exclude the "transitive"
>             dependency on
>             org.apache.calcite until we noticed that calcite is packaged
>             with
>             flink-table, so that you can not even exclude it. What is
>             the reasons
>             for this?
> 
> 
> 
> 
>             On 30.11.2016 00:55, Stefan Richter wrote:
>             > Hi,
>             >
>             > could you somehow provide us a heap dump from a TM that
>             run for a while (ideally, shortly before an OOME)? This
>             would greatly help us to figure out if there is a
>             classloader leak that causes the problem.
>             >
>             > Best,
>             > Stefan
>             >
>             >> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf
>             <konstantin.kn...@tngtech.com
>             <mailto:konstantin.kn...@tngtech.com>>:
>             >>
>             >> Hi everyone,
>             >>
>             >> since upgrading to Flink 1.1.3 we observe frequent OOME
>             Permgen Taskmanager Failures. Monitoring the permgen size on
>             one of the Taskamanagers you can see that each Job (New Job
>             and Restarts) adds a few MB, which can not be collected.
>             Eventually, the OOME happens. This happens with all our
>             Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>             >>
>             >> On Flink 1.0.2 this was not a problem, but I will
>             investigate it further.
>             >>
>             >> The assumption is that Flink is somehow using one of the
>             classes, which comes with our jar and by that prevents the
>             gc of the whole class loader. Our Jars do not include any
>             flink dependencies though (compileOnly), but of course many
>             others.
>             >>
>             >> Any ideas anyone?
>             >>
>             >> Cheers and thank you,
>             >>
>             >> Konstantin
>             >>
>             >> sent from my phone. Plz excuse brevity and tpyos.
>             >> ---
>             >> Konstantin Knauf *konstantin.kn...@tngtech.com
>             <mailto:konstantin.kn...@tngtech.com> * +49-174-3413182
>             <tel:%2B49-174-3413182>
>             >> TNG Technology Consulting GmbH, Betastr. 13a, 85774
>             Unterföhring
>             >> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr.
>             Robert Dahlke
>             >
>             >
> 
>             --
>             Konstantin Knauf * konstantin.kn...@tngtech.com
>             <mailto:konstantin.kn...@tngtech.com> * +49-174-3413182
>             <tel:%2B49-174-3413182>
>             TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>             Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert
>             Dahlke
>             Sitz: Unterföhring * Amtsgericht München * HRB 135082
> 
> 
> 
> 

-- 
Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

signature.asc
Description: OpenPGP digital signature

Re: Flink 1.1.3 OOME Permgen

Reply via email to