date:20241121

Re: [PR] [FLINK-36685][Kubernetes Operator] allow CREATE/UPDATE operation on flinkdeployments resource on webhook mutation endpoint [flink-kubernetes-operator]

2024-11-21 Thread via GitHub



gyfora merged PR #916:
URL: https://github.com/apache/flink-kubernetes-operator/pull/916


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-35859) [flink-cdc] Fix: The assigner is not ready to offer finished split information, this should not be called

2024-11-21 Thread Xin Gong (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900028#comment-17900028
 ] 

Xin Gong commented on FLINK-35859:
--

[~loserwang1024] Users cannot immediately perceive task issues. Maybe we can 
fix it to more perfect. I add a flag to trigger restart when status is 
NEWLY_ADDED_ASSIGNING_SNAPSHOT_FINISHED. newly table will be synchronized.
{code:java}
// code placeholder


/** Assigner for snapshot split. */
public class SnapshotSplitAssigner implements 
SplitAssigner {
private static final Logger LOG = 
LoggerFactory.getLogger(SnapshotSplitAssigner.class);

private boolean flagExceptionAssignerStatusWhenCheckpoint;

  
private void captureNewlyAddedTables() {
if (sourceConfig.isScanNewlyAddedTableEnabled() && 
AssignerStatus.isAssigningFinished(assignerStatus)) {
..
} else if 
(AssignerStatus.isNewlyAddedAssigningSnapshotFinished(assignerStatus)) {
flagExceptionAssignerStatusWhenCheckpoint = true;
LOG.info("exceptionAssignerStatusCheckpointFlag to true");
}
}

@Override
public void notifyCheckpointComplete(long checkpointId) {
  
if (AssignerStatus.isNewlyAddedAssigningFinished(assignerStatus)
&& flagExceptionAssignerStatusWhenCheckpoint) {
throw new FlinkRuntimeException("Previous assigner status is 
NEWLY_ADDED_ASSIGNING_SNAPSHOT_FINISHED and "
+ "newly add table will cause task always be exception from 
checkpoint, so we "
+ "trigger restart for newly table after assigner to normal 
status");
}
}

}
 {code}

> [flink-cdc] Fix: The assigner is not ready to offer finished split 
> information, this should not be called
> -
>
> Key: FLINK-35859
> URL: https://issues.apache.org/jira/browse/FLINK-35859
> Project: Flink
>  Issue Type: Bug
>  Components: Flink CDC
>Affects Versions: cdc-3.1.1
>Reporter: Hongshun Wang
>Assignee: Hongshun Wang
>Priority: Minor
> Fix For: cdc-3.2.0
>
>
> When use CDC with newly added table,  an error occurs: 
> {code:java}
> The assigner is not ready to offer finished split information, this should 
> not be called. {code}
> It's because:
> 1. when stop then restart the job , the status is 
> NEWLY_ADDED_ASSIGNING_SNAPSHOT_FINISHED.
>  
> 2. Then Enumerator will send each reader with 
> BinlogSplitUpdateRequestEvent to update binlog. (see 
> org.apache.flink.cdc.connectors.mysql.source.enumerator.MySqlSourceEnumerator#syncWithReaders).
> 3. The Reader will suspend binlog reader then send 
> BinlogSplitMetaRequestEvent to Enumerator.
> 4. The Enumerator found that some tables are not sent, an error will occur
> {code:java}
> private void sendBinlogMeta(int subTask, BinlogSplitMetaRequestEvent 
> requestEvent) {
> // initialize once
> if (binlogSplitMeta == null) {
> final List finishedSnapshotSplitInfos =
> splitAssigner.getFinishedSplitInfos();
> if (finishedSnapshotSplitInfos.isEmpty()) {
> LOG.error(
> "The assigner offers empty finished split information, 
> this should not happen");
> throw new FlinkRuntimeException(
> "The assigner offers empty finished split information, 
> this should not happen");
> }
> binlogSplitMeta =
> Lists.partition(
> finishedSnapshotSplitInfos, 
> sourceConfig.getSplitMetaGroupSize());
>} 
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36535) Optimize the scale down logic based on historical parallelism

2024-11-21 Thread Gyula Fora (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900029#comment-17900029
 ] 

Gyula Fora commented on FLINK-36535:


I think this would be a nice improvement (y)

> Optimize the scale down logic based on historical parallelism
> -
>
> Key: FLINK-36535
> URL: https://issues.apache.org/jira/browse/FLINK-36535
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler
>Reporter: Rui Fan
>Assignee: Rui Fan
>Priority: Major
>
> This is a follow-up to FLINK-36018 . FLINK-36018 supported the lazy scale 
> down to avoid frequent rescaling.
> h1. Proposed Change
> Treat scale-down.interval as a window:
>  * Recording the scale down trigger time when the recommended parallelism < 
> current parallelism
>  ** When the recommended parallelism >= current parallelism, cancel the 
> triggered scale down
>  * The scale down will be executed when currentTime - triggerTime > 
> scale-down.interval
>  ** {color:#de350b}Change1{color}: Using the maximum parallelism within the 
> window instead of the latest parallelism when scaling down.
>  * {color:#de350b}Change2{color}: Never scale down when currentTime - 
> triggerTime < scale-down.interval
>  * 
>  ** In the FLINK-36018, the scale down may be executed when currentTime - 
> triggerTime < scale-down.interval.
>  ** For example: the taskA may scale down when taskB needs to scale up.
> h1. Background
> Some critical Flink jobs need to scale up in time, but only scale down on a 
> daily basis. In other words, Flink users do not want Flink jobs to be scaled 
> down multiple times within 24 hours, and jobs run at the same parallelism as 
> during the peak hours of each day. 
> Note: Users hope to scale down only happens when the parallelism during peak 
> hours still wastes resources. This is a trade-off between downtime and 
> resource waste for a critical job.
> h1. Current solution
> In general, this requirement could be met after setting{color:#de350b} 
> job.autoscaler.scale-down.interval= 24 hour{color}. When taskA runs with 100 
> parallelism, and recommended parallelism is 100 during the peak hours of each 
> day. We hope taskA doesn't rescale forever, because the triggered scale down 
> will be canceled once the recommended parallelism >= current parallelism 
> within 24 hours (It‘s exactly what FLINK-36018 does).
> h1. Unexpected Scenario & how to solve?
> But I found the critical production job is still rescaled about 10 times 
> every day (when scale-down.interval is set to 24 hours).
> Root cause: There may be many sources in a job, and the traffic peaks of 
> these sources may occur at different times. When taskA triggers scale down, 
> the scale down of taskA will not be actively executed within 24 hours, but it 
> may be executed when other tasks are scaled up.
> For example:
>  * The scale down of sourceB and sourceC may be executed when SourceA scales 
> up.
>  * After a while, the scale down of sourceA and sourceC may be executed when 
> SourceB scales up.
>  * After a while, the scale down of sourceA and sourceB may be executed when 
> SourceC scales up.
>  * When there are many tasks, the above 3 steps will be executed repeatedly.
> That's why the job is rescaled about 10 times every day, the 
> {color:#de350b}change2{color} of proposed change could solve this issue: 
> Never scale down when currentTime - triggerTime < scale-down.interval.
>  
> {color:#de350b}Change1{color}: Using the maximum parallelism within the 
> window instead of the latest parallelism when scaling down.
>  * It can ensure that the parallelism after scaling down is the parallelism 
> at yesterday's peak.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36535) Optimize the scale down logic based on historical parallelism

2024-11-21 Thread Rui Fan (Jira)

[
https://issues.apache.org/jira/browse/FLINK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rui Fan updated FLINK-36535:

Description:
This is a follow-up to FLINK-36018 . FLINK-36018 supported the lazy scale down
to avoid frequent rescaling.
h1. Proposed Change

Treat scale-down.interval as a window:
* Recording the scale down trigger time when the recommended parallelism <
current parallelism
** When the recommended parallelism >= current parallelism, cancel the
triggered scale down
* The scale down will be executed when currentTime - triggerTime >
scale-down.interval
** {color:#de350b}Change1{color}: Using the maximum parallelism within the
window instead of the latest parallelism when scaling down.
* {color:#de350b}Change2{color}: Never scale down when currentTime -
triggerTime < scale-down.interval

** In the FLINK-36018, the scale down may be executed when currentTime -
triggerTime < scale-down.interval.
** For example: the taskA may scale down when taskB needs to scale up.

h1. Background

Some critical Flink jobs need to scale up in time, but only scale down on a
daily basis. In other words, Flink users do not want Flink jobs to be scaled
down multiple times within 24 hours, and the jobs run at the same parallelism
as during the peak hours of each day.

Note: Users hope to scale down only happens when the parallelism during peak
hours is still a waste of resources. This is a trade-off between downtime and
resource waste for a critical job.
h1. Current solution

In general, this requirement could be met after setting{color:#de350b}
job.autoscaler.scale-down.interval= 24 hour{color}. When taskA runs with 100
parallelism, and recommended parallelism is 100 during the peak hours of each
day. We hope taskA doesn't rescale forever, because the triggered scale down
will be canceled once the recommended parallelism >= current parallelism within
24 hours (It‘s exactly what FLINK-36018 does).
h1. Unexpected Scenario & how to solve?

But I found the critical production job is still rescaled about 10 times every
day (when scale-down.interval is set to 24 hours).

Root cause: There may be many sources in a job, and the traffic peaks of these
sources may occur at different times. When taskA triggers scale down, the scale
down of taskA will not be actively executed within 24 hours, but it may be
executed when other tasks are scaled up.

For example:
* The scale down of sourceB and sourceC may be executed when SourceA scales up.
* After a while, the scale down of sourceA and sourceC may be executed when
SourceB scales up.
* After a while, the scale down of sourceA and sourceB may be executed when
SourceC scales up.
* When there are many tasks, the above 3 steps will be executed repeatedly.

That's why the job is rescaled about 10 times every day, the
{color:#de350b}change2{color} of proposed change could solve this issue: Never
scale down when currentTime - triggerTime < scale-down.interval.

{color:#de350b}Change1{color}: Using the maximum parallelism within the window
instead of the latest parallelism when scaling down.
* It can ensure that the parallelism after scaling down is the parallelism at
yesterday's peak.

was:
This is a follow-up to FLINK-36018 . FLINK-36018 supported the lazy scale down
to avoid frequent rescaling.
h1. Background

In general, this requirement could be met after setting{color:#de350b}
job.autoscaler.scale-down.interval= 24 hour{color}. For example, the vertex1
runs with parallelism=100, and the following is the parallelism that the
autoscaler recommends for vertex1:
* 100 (2024-10-13 20:00:00, peak hour)
* 90 (2024-10-13 21:00:00, trigger delayed scale down)
* 80 (2024-10-13 22:00:00)
* 70 (2024-10-14 00:00:00)
* 60 (2024-10-14 01:00:00)
* 50 (2024-10-14 02:00:00)
* 40 (2024-10-14 04:00:00)
* 50 (2024-10-14 06:00:00)
* 60 (2024-10-14 08:00:00)
* ...
* 90 (2024-10-14 19:00:00)
* 100 (2024-10-14 20:00:00, peak hour, the delayed scale down is canceled)

All recommended parallelism are delayed, and the recommended parallelism is
backed to 100 within 24 hours. So the scale down request is canceled.

It means if the recommended parallelism for vertex1 during peak hours is 100
every day, this vertex1 never be scaled down and scaled up. It is very friendly
to critical jobs, and reducing the scale frequency can greatly reduce the
downtime.
h1. Some scenarios do no

[jira] [Updated] (FLINK-36535) Optimize the scale down logic based on historical parallelism

2024-11-21 Thread Rui Fan (Jira)

[
https://issues.apache.org/jira/browse/FLINK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rui Fan updated FLINK-36535:

Description:
This is a follow-up to FLINK-36018 . FLINK-36018 supported the lazy scale down
to avoid frequent rescaling.
h1. Proposed Change

*
** In the FLINK-36018, the scale down may be executed when currentTime -
triggerTime < scale-down.interval.
** For example: the taskA may scale down when taskB needs to scale up.

h1. Background

Some critical Flink jobs need to scale up in time, but only scale down on a
daily basis. In other words, Flink users do not want Flink jobs to be scaled
down multiple times within 24 hours, and jobs run at the same parallelism as
during the peak hours of each day.

Note: Users hope to scale down only happens when the parallelism during peak
hours still wastes resources. This is a trade-off between downtime and resource
waste for a critical job.
h1. Current solution

But I found the critical production job is still rescaled about 10 times every
day (when scale-down.interval is set to 24 hours).

was:
This is a follow-up to FLINK-36018 . FLINK-36018 supported the lazy scale down
to avoid frequent rescaling.
h1. Proposed Change

** In the FLINK-36018, the scale down may be executed when currentTime -
triggerTime < scale-down.interval.
** For example: the taskA may scale down when taskB needs to scale up.

1 2 >

1 - 100 of 133 matches

Mail list logo