Re: Job crash in job cluster mode

Robert Metzger Thu, 12 Nov 2020 00:34:16 -0800

Hey Tim,

what Is your Flink job doing? Is it restarting from time to time?
Is the JobManager crashing, or the TaskManager?


On Tue, Nov 10, 2020 at 6:01 PM Matthias Pohl <matth...@ververica.com>
wrote:

> Hi Tim,
> I'm not aware of any memory-related issues being related to the deployment
> mode used. Have you checked the logs for hints? Additionally, you could try
> to extract a heap dump. That might help you in analyzing the cause of the
> memory consumption.
>
> The TaskManager and JobManager are logging the effective memory-related
> configuration during startup. You can look out for the "Preconfiguration"
> section in each of the log files to get a drill-down of how much memory is
> used per memory pool.
>
> Best,
> Matthias
>
> On Tue, Nov 10, 2020 at 3:37 PM Tim Eckhardt <tim.eckha...@uniberg.com>
> wrote:
>
>> Hi there,
>>
>>
>>
>> I have a problem with running a flink job in job cluster mode using flink
>> 1.11.1 (also tried 1.11.2).
>>
>> The same job is running well using the session cluster mode as well as
>> using flink 1.10.0 in job cluster mode.
>>
>>
>>
>> The job starts running and is running for quite some time but it runs a
>> lot slower than in session cluster mode and crashes after running for about
>> an hour. I can observe in the flink dashboard that the JVM heap is constant
>> at a high level and is getting slowly closer to the limit (4.13GB in my
>> case) which it reaches close to the job crashing.
>>
>> There is also some G1_Old_Generation garbage collection going on which I
>> cannot observe in session mode as well.
>>
>>
>>
>> GC values after running for about 45min:
>>
>>
>>
>> (Collector, Count, Time)
>>
>> *G1_Young_Generation   *1,250  107,937
>>
>> *G1_Old_Generation  *322  2,432,362
>>
>>
>>
>> Compared to the GC values of the same job in session cluster mode (after
>> the same runtime):
>>
>>
>>
>> *G1_Young_Generation   *1,920  20,575
>>
>> *G1_Old_Generation  *0  0
>>
>>
>>
>> So my vague guess is that it has to be something memory related maybe
>> configuration wise.
>>
>>
>>
>> To simplify the setup only one jobmanager and one taskmanager is used.
>> The taskmanager has a memory setting of: taskmanager.memory.process.size:
>> 10000m which should be totally fine for the server. The jobmanager has a
>> defined heap_size of 1600m.
>>
>>
>>
>> Maybe somebody has experienced something like this before?
>>
>>
>>
>> Also is there a way to export the currently loaded configuration
>> parameters of the job- and taskmanagers in a cluster? For example I can’t
>> see the current memory process size of the taskmanager in the flink
>> dashboard. Because this way I could compare the running and crashing setups
>> more easily (using docker and environment variables for configuration at
>> the moment which makes it a bit harder to debug).
>>
>>
>>
>> Thanks.
>>
>

Re: Job crash in job cluster mode

Reply via email to