[ 
https://issues.apache.org/jira/browse/FLINK-36290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884172#comment-17884172
 ] 

Matthias Pohl edited comment on FLINK-36290 at 9/24/24 8:21 AM:
----------------------------------------------------------------

{quote}
I had a look at all the OOM cases, they seem don't have relevance. I thought I 
can find some tests that cause OOM the most, but looks they all happen in 
different tests/projects:
{quote}

[~showuon] For OutOfMemoryErrors it's quite normal that multiple test stages 
are affected. We've seen this in the past with AlibabaVMs. What's new here is 
that actual Azure VMs are affected (which I would have assumed to have proper 
isolation). Hence, consider FLINK-36291 as well when investigating the OOM (use 
the log timestamps to investigate whether the OOMs might be related)

Usually, you want to look for a diff of test stages between multiple failed 
builds (i.e. in all the test runs where OOMs occurred, which test stage was 
always affected). That would give you a hint which test stage is causing the 
issue. My bet would be on the hive dependency based on the CI runs.

The problem with the Azure VMs is that we do not have access to them. That 
makes it harder to get a heap dump of those machines. You might need to do some 
research to check whether there is a way to retrieve the heap dumps from Azure. 

Apache Infra is not involved with AzureCI. That's a Flink project-specific 
thing.


was (Author: mapohl):
{quote}
I had a look at all the OOM cases, they seem don't have relevance. I thought I 
can find some tests that cause OOM the most, but looks they all happen in 
different tests/projects:
{quote}

[~showuon] For OutOfMemoryErrors is quite normal that multiple test stages are 
affected. We've seen this in the past with AlibabaVMs. What's new here is that 
actual Azure VMs are affected (which I would have assumed to have proper 
isolation). Hence, consider FLINK-36291 as well when investigating the OOM.

Usually, you want to look for a diff of test stages between multiple failed 
builds (i.e. in all the test runs where OOMs occurred, which test stage was 
always affected). That would give you a hint which test stage is causing the 
issue. My bet would be on the hive dependency based on the CI runs.

The problem with the Azure VMs is that we do not have access to them. That 
makes it harder to get a heap dump of those machines. You might need to do some 
research to check whether there is a way to retrieve the heap dumps from Azure. 

Apache Infra is not involved with AzureCI. That's a Flink project-specific 
thing.

> OutOfMemoryError in connect test run
> ------------------------------------
>
>                 Key: FLINK-36290
>                 URL: https://issues.apache.org/jira/browse/FLINK-36290
>             Project: Flink
>          Issue Type: Bug
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Tests
>    Affects Versions: 2.0-preview
>            Reporter: Matthias Pohl
>            Priority: Blocker
>
> We saw a OOM in the connect stage that's caused a fatal error:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62173&view=logs&j=1c002d28-a73d-5309-26ee-10036d8476b4&t=d1c117a6-8f13-5466-55f0-d48dbb767fcd&l=12182
> {code}
> 03:19:59,975 [   flink-scheduler-1] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler              [] - FATAL: 
> Thread 'flink-scheduler-1' produced an uncaught exception. Stopping the 
> process...
> java.lang.OutOfMemoryError: Java heap space
> [...]
> 03:19:59,981 [jobmanager_62-main-scheduler-thread-1] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler              [] - FATAL: 
> Thread 'jobmanager_62-main-scheduler-thread-1' produced an uncaught 
> exception. Stopping the process...
> java.lang.OutOfMemoryError: Java heap space
> [...]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to