Re: Checkpoints for kafka source sometimes get 55 GB size (instead of 2 MB) and flink job fails during restoring from such checkpoint

Timo Walther Tue, 14 Apr 2020 23:45:27 -0700

Hi Oleg,

this sounds indeed like abnormal behavior. Are you sure that these largecheckpoints are related to the Kafka consumer only? Are there otheroperators in the pipeline? Because internally the state kept in a Kafkaconsumer is pretty minimal and only related to Kafka partition andoffset management.

If you are sure that the Kafka consumer must produce such a state size,I would recommend to use a remote debugger and check what ischeckpointed in the corresponding `FlinkKafkaConsumerBase#snapshotState`.


Regards,
Timo


On 15.04.20 03:37, Oleg Vysotsky wrote:

Hello,
Sometime our flink job starts creating large checkpoints which include55 Gb (instead of 2 MB) related to kafka source. After the flink jobcreates first “abnormal” checkpoint all next checkpoints are “abnormal”as well. Flink job can’t be restored from such checkpoint. Restoringfrom the checkpoint hangs/fails. Also flnk dashboard hangs and flinkcluster crashs during the restoring from such checkpoint. We didn’tcatch related error message. Also we don’t find clear way to reproducethis problem (when the flink job creates “abnormal” checkpoints).
Configuration:

We are using flink 1.8.1 on emr (emr 5.27)

Kafka: confluence kafka 5.4.1
Flink kafka connector: org.apache.flink:flink-connector-kafka_2.11:1.8.1 (it includesorg.apache.kafka:kafka-clients:2.0.1 dependencies)
Our input kafka topic has 32 partitions and related flink source has 32parallelism
We use pretty much all default flink kafka concumer setting. We onlyspecified:
CommonClientConfigs.BOOTSTRAP_SERVERS_CONFIG,

ConsumerConfig.GROUP_ID_CONFIG,

CommonClientConfigs.SECURITY_PROTOCOL_CONFIG

Thanks a lot  in advance!

Oleg

Re: Checkpoints for kafka source sometimes get 55 GB size (instead of 2 MB) and flink job fails during restoring from such checkpoint

Reply via email to