[jira] [Commented] (FLINK-24539) ChangelogNormalize operator tooks too long time to INITIALIZING until failed

Piotr Nowojski (Jira) Mon, 25 Oct 2021 05:14:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-24539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433717#comment-17433717
 ]


Piotr Nowojski commented on FLINK-24539:
----------------------------------------

{quote}
(I presume it's busy recovering it's state, unless you are using unaligned 
checkpoints).
{quote}
I meant that your task is taking long time to recover. This means either 
recovering operator's state OR processing in-flight data from unaligned 
checkpoints. Enabling unaligned checkpoints will not help in your situation. In 
contrary, it can extend the recovery phase.

Maybe what you want to do is to just extend checkpoint timeout?

But answering your question, you can enabled unaligned checkpoints for example 
via 
[flink-conf.yaml|https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#execution-checkpointing-unaligned]
 or 
{{org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#getCheckpointConfig#enableUnalignedCheckpoints()}}
 call. Note, enabling unaligned checkpoints will most likely increase your 
state size. Please read more about it for example 
[here|https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/unaligned_checkpoints/].



> ChangelogNormalize operator tooks too long time to INITIALIZING until failed
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-24539
>                 URL: https://issues.apache.org/jira/browse/FLINK-24539
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Task, Table SQL / 
> Runtime
>    Affects Versions: 1.13.1
>         Environment: Flink version :1.13.1
> TaskManager memory:
> !image-2021-10-14-13-36-56-899.png|width=578,height=318!
> JobManager memory:
> !image-2021-10-14-13-37-51-445.png|width=578,height=229!
>            Reporter: vmaster.cc
>            Priority: Major
>         Attachments: image-2021-10-14-13-19-08-215.png, 
> image-2021-10-14-13-36-56-899.png, image-2021-10-14-13-37-51-445.png, 
> image-2021-10-14-14-13-13-370.png, image-2021-10-14-14-15-40-101.png, 
> image-2021-10-14-14-16-33-080.png, 
> taskmanager_container_e11_1631768043929_0012_01_000004_log.txt
>
>
> I'm using debezium to produce cdc from mysql, considering its at least one 
> delivery, so i must set the config 
> 'table.exec.source.cdc-events-duplicate=true'.
> But when some unknown case make my task down, flink task restart  failed 
> always. I found that ChangelogNormalize operator tooks too long time in 
> INITIALIZING stage.
>  
> screenshot and log fragment are as follows:
> !image-2021-10-14-13-19-08-215.png|width=567,height=293!
>  
> {code:java}
> 2021-10-14 12:32:33,660 INFO  
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder [] - 
> Finished building RocksDB keyed state-backend at 
> /data3/yarn/nm/usercache/flink/appcache/application_1631768043929_0012/flink-io-f31735c3-e726-4c49-89a5-916670809b7a/job_7734977994a6a10f7cc784d50e4a1a34_op_KeyedProcessOperator_dc2290bb6f8f5cd2bd425368843494fe__1_1__uuid_6cbbe6ae-f43e-4d2a-b1fb-f0cb71f257af.2021-10-14
>  12:32:33,662 INFO  org.apache.flink.runtime.taskmanager.Task                 
>    [] - GroupAggregate(groupBy=[teacher_id, create_day], select=[teacher_id, 
> create_day, SUM_RETRACT($f2) AS teacher_courseware_count]) -> 
> Calc(select=[teacher_id, create_day, CAST(teacher_courseware_count) AS 
> teacher_courseware_count]) -> NotNullEnforcer(fields=[teacher_id, 
> create_day]) (1/1)#143 (9cca3ef1293cc6364698381bbda93998) switched from 
> INITIALIZING to RUNNING.2021-10-14 12:38:07,581 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Ignoring 
> checkpoint aborted notification for non-running task 
> ChangelogNormalize(key=[c_id]) -> Calc(select=[c_author_id AS teacher_id, 
> DATE_FORMAT(c_create_time, _UTF-16LE'yyyy-MM-dd') AS create_day, IF((c_state 
> = 10), 1, 0) AS $f2], where=[((c_is_group = 0) AND (c_author_id <> 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"))]) 
> (1/1)#143.2021-10-14 12:38:07,581 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Attempting 
> to cancel task Sink: 
> Sink(table=[default_catalog.default_database.t_flink_school_teacher_courseware_count],
>  fields=[teacher_id, create_day, teacher_courseware_count]) (2/2)#143 
> (cc25f9ae49c4db01ab40ff103fae43fd).2021-10-14 12:38:07,581 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Sink: 
> Sink(table=[default_catalog.default_database.t_flink_school_teacher_courseware_count],
>  fields=[teacher_id, create_day, teacher_courseware_count]) (2/2)#143 
> (cc25f9ae49c4db01ab40ff103fae43fd) switched from RUNNING to 
> CANCELING.2021-10-14 12:38:07,581 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Triggering 
> cancellation of task code Sink: 
> Sink(table=[default_catalog.default_database.t_flink_school_teacher_courseware_count],
>  fields=[teacher_id, create_day, teacher_courseware_count]) (2/2)#143 
> (cc25f9ae49c4db01ab40ff103fae43fd).2021-10-14 12:38:07,583 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Attempting 
> to cancel task Sink: 
> Sink(table=[default_catalog.default_database.t_flink_school_teacher_courseware_count],
>  fields=[teacher_id, create_day, teacher_courseware_count]) (1/2)#143 
> (5419f41a3f0cc6c2f3f4c82c87f4ae22).2021-10-14 12:38:07,583 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Sink: 
> Sink(table=[default_catalog.default_database.t_flink_school_teacher_courseware_count],
>  fields=[teacher_id, create_day, teacher_courseware_count]) (1/2)#143 
> (5419f41a3f0cc6c2f3f4c82c87f4ae22) switched from RUNNING to CANCELING.
> {code}
>  
> attention:
> 1、The table has a large amount of data, up to 500 million. 
> 2、Because the amount of data is very large, the rocksdb state backend is used
> 3、More other env infos ,see next section and the full log see attachment.
> {code:java}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-24539) ChangelogNormalize operator tooks too long time to INITIALIZING until failed

Reply via email to