Rui Fan created FLINK-33354: ------------------------------- Summary: Reuse the TaskInformation for multiple slots Key: FLINK-33354 URL: https://issues.apache.org/jira/browse/FLINK-33354 Project: Flink Issue Type: Sub-task Components: Runtime / Task Affects Versions: 1.17.1, 1.18.0 Reporter: Rui Fan Assignee: Rui Fan
The background is similar to FLINK-33315. A hive table with a lot of data, and the HiveSource#partitionBytes is 281MB. When slotPerTM = 4, one TM will run 4 HiveSources at the same time. How the TaskExecutor to submit a large task? # TaskExecutor#loadBigData will read all bytes from file to SerializedValue<TaskInformation> ** The SerializedValue<TaskInformation> has a byte[] ** It will cost the heap memory ** It will be great than 281 MB, because it not only stores HiveSource#partitionBytes, it also stores other information of TaskInformation. # Generate the TaskInformation from SerializedValue<TaskInformation> ** TaskExecutor#submitTask calls the tdd.getSerializedTaskInformation()..deserializeValue() ** tdd.getSerializedTaskInformation() is SerializedValue<TaskInformation> ** It will generate the TaskInformation ** TaskInformation includes the Configuration {color:#9876aa}taskConfiguration{color} ** The {color:#9876aa}taskConfiguration{color} includes StreamConfig#{color:#9876aa}SERIALIZEDUDF{color} {color:#172b4d}Based on the above process, TM memory will have 2 big byte array for each task:{color} * {color:#172b4d}The SerializedValue<TaskInformation>{color} * {color:#172b4d}The TaskInformation{color} When one TM runs 4 HiveSources at the same time, it will have 8 big byte array. In our production environment, this is also a situation that often leads to TM OOM. h2. Solution: These data is totally same due to the PermanentBlobKey is same. We can add a cache for it to reduce the memory and cpu cost. -- This message was sent by Atlassian Jira (v8.20.10#820010)