Hi Team, I am currently running batch jobs on flink 1.19.2 and the expected throughput for a certain step is above 1TB. What I have observed is that everytime my job reaches close to 30%, it fails with the below error after a few retries. I am currently running this on kubernetes and using 6 CPU and 48GB Memory resources for TaskManager and 3 CPU and 32GB Memory for JobManager.
Caused by: java.io.IOException: Cannot write record to fresh sort buffer. Record too large. at org.apache.flink.runtime.operators.chaining. SynchronousChainedCombineDriver.collect(SynchronousChainedCombineDriver .java:190) at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect (CountingCollector.java:35) at org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect( ChainedMapDriver.java:78) at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect (CountingCollector.java:35) at org.apache.flink.api.java.operators.translation. PlanFilterOperator$FlatMapFilter.flatMap(PlanFilterOperator.java:58) at org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java: 113) at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:516) at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:359) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring( Task.java:958) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:751) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) at java.base/java.lang.Thread.run(Unknown Source) I have two questions here, 1.How can I handle this issue of sort buffer, Record too large? 2.What is the recommended way to handle large records in flink Batch Jobs? Please help me out with this. Regards, Suraj