[GitHub] [flink] ljz2051 opened a new pull request, #23407: [FLINK-32953][docs]Add notes about changing state ttl value

2023-09-13 Thread via GitHub


ljz2051 opened a new pull request, #23407:
URL: https://github.com/apache/flink/pull/23407

   ## What is the purpose of the change
   
   This pull request add notes about changing state TTL value when restore 
checkpoint.
   
   ## Brief change log
   
   - Add a note in "state.md" document.
   
   ## Verifying this change
   
   This change is a trivial work about document without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers: no
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
 - The S3 file system connector: no
   
   ## Documentation
 - Does this pull request introduce a new feature? no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-32953) [State TTL]resolve data correctness problem after ttl was changed

2023-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32953:
---
Labels: pull-request-available  (was: )

> [State TTL]resolve data correctness problem after ttl was changed 
> --
>
> Key: FLINK-32953
> URL: https://issues.apache.org/jira/browse/FLINK-32953
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Reporter: Jinzhong Li
>Assignee: Jinzhong Li
>Priority: Major
>  Labels: pull-request-available
>
> Because expired data is cleaned up in background on a best effort basis 
> (hashmap use INCREMENTAL_CLEANUP strategy, rocksdb use 
> ROCKSDB_COMPACTION_FILTER strategy), some expired state is often persisted 
> into snapshots.
>  
> In some scenarios, user changes the state ttl of the job and then restore job 
> from the old state. If the user adjust the state ttl from a short value to a 
> long value (eg, from 12 hours to 24 hours),  some expired data that was not 
> cleaned up will be alive after restore. Obviously this is unreasonable, and 
> may break data regulatory requirements. 
>  
> Particularly, rocksdb stateBackend may cause data correctness problems due to 
> level compaction in this case.(eg. One key has two versions at level-1 and 
> level-2,both of which are ttl expired. Then level-1 version is cleaned up by 
> compaction,  and level-2 version isn't.  If we adjust state ttl and restart 
> job, the incorrect data of level-2 will become valid after restore)
>  
> To solve this problem, I think we can
> 1) persist old state ttl into snapshot meta info; (eg. 
> RegisteredKeyValueStateBackendMetaInfo or others)
> 2) During state restore, check the size between the current ttl and old ttl;
> 3) If current ttl is longer than old ttl, we need to iterate over all data, 
> filter out expired data with old ttl, and wirte valid data into stateBackend.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33080) The checkpoint storage configured in the job level by config option will not take effect

2023-09-13 Thread Junrui Li (Jira)
Junrui Li created FLINK-33080:
-

 Summary: The checkpoint storage configured in the job level by 
config option will not take effect
 Key: FLINK-33080
 URL: https://issues.apache.org/jira/browse/FLINK-33080
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Junrui Li
 Fix For: 1.19.0


When we configure the checkpoint storage at the job level, it can only be done 
through the following method:
{code:java}
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.getCheckpointConfig().setCheckpointStorage(xxx); {code}
However, configure the checkpoint storage by the job-side configuration like 
the following will not take effect:
{code:java}
Configuration configuration = new Configuration();
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(configuration);
configuration.set(CheckpointingOptions.CHECKPOINT_STORAGE, "filesystem");
configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
{code}
This behavior is unexpected, we should allow this way will take effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-33080) The checkpoint storage configured in the job level by config option will not take effect

2023-09-13 Thread Lijie Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijie Wang reassigned FLINK-33080:
--

Assignee: Junrui Li

> The checkpoint storage configured in the job level by config option will not 
> take effect
> 
>
> Key: FLINK-33080
> URL: https://issues.apache.org/jira/browse/FLINK-33080
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
> Fix For: 1.19.0
>
>
> When we configure the checkpoint storage at the job level, it can only be 
> done through the following method:
> {code:java}
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> env.getCheckpointConfig().setCheckpointStorage(xxx); {code}
> However, configure the checkpoint storage by the job-side configuration like 
> the following will not take effect:
> {code:java}
> Configuration configuration = new Configuration();
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment(configuration);
> configuration.set(CheckpointingOptions.CHECKPOINT_STORAGE, "filesystem");
> configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
> {code}
> This behavior is unexpected, we should allow this way will take effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] flinkbot commented on pull request #23407: [FLINK-32953][docs]Add notes about changing state ttl value

2023-09-13 Thread via GitHub


flinkbot commented on PR #23407:
URL: https://github.com/apache/flink/pull/23407#issuecomment-1717067773

   
   ## CI report:
   
   * f59f7f7004c10ba16d2cdbb1dec363964414880c UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-32953) [State TTL]resolve data correctness problem after ttl was changed

2023-09-13 Thread Jinzhong Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764518#comment-17764518
 ] 

Jinzhong Li commented on FLINK-32953:
-

[~masteryhx]  I publish a pr which add some notes about this issue in document. 
Could you please help review the pr?

> [State TTL]resolve data correctness problem after ttl was changed 
> --
>
> Key: FLINK-32953
> URL: https://issues.apache.org/jira/browse/FLINK-32953
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Reporter: Jinzhong Li
>Assignee: Jinzhong Li
>Priority: Major
>  Labels: pull-request-available
>
> Because expired data is cleaned up in background on a best effort basis 
> (hashmap use INCREMENTAL_CLEANUP strategy, rocksdb use 
> ROCKSDB_COMPACTION_FILTER strategy), some expired state is often persisted 
> into snapshots.
>  
> In some scenarios, user changes the state ttl of the job and then restore job 
> from the old state. If the user adjust the state ttl from a short value to a 
> long value (eg, from 12 hours to 24 hours),  some expired data that was not 
> cleaned up will be alive after restore. Obviously this is unreasonable, and 
> may break data regulatory requirements. 
>  
> Particularly, rocksdb stateBackend may cause data correctness problems due to 
> level compaction in this case.(eg. One key has two versions at level-1 and 
> level-2,both of which are ttl expired. Then level-1 version is cleaned up by 
> compaction,  and level-2 version isn't.  If we adjust state ttl and restart 
> job, the incorrect data of level-2 will become valid after restore)
>  
> To solve this problem, I think we can
> 1) persist old state ttl into snapshot meta info; (eg. 
> RegisteredKeyValueStateBackendMetaInfo or others)
> 2) During state restore, check the size between the current ttl and old ttl;
> 3) If current ttl is longer than old ttl, we need to iterate over all data, 
> filter out expired data with old ttl, and wirte valid data into stateBackend.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-30400) Stop bundling connector-base in externalized connectors

2023-09-13 Thread Leonard Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonard Xu reassigned FLINK-30400:
--

Assignee: Hang Ruan

> Stop bundling connector-base in externalized connectors
> ---
>
> Key: FLINK-30400
> URL: https://issues.apache.org/jira/browse/FLINK-30400
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Common
>Reporter: Chesnay Schepler
>Assignee: Hang Ruan
>Priority: Major
>
> Check that none of the externalized connectors bundle connector-base; if so 
> remove the bundling and schedule a new minor release.
> Bundling this module is highly problematic w.r.t. binary compatibility, since 
> bundled classes may rely on internal APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33080) The checkpoint storage configured in the job level by config option will not take effect

2023-09-13 Thread Junrui Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junrui Li updated FLINK-33080:
--
Description: 
When we configure the checkpoint storage at the job level, it can only be done 
through the following method:
{code:java}
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.getCheckpointConfig().setCheckpointStorage(xxx); {code}
or configure filesystem storage by config option 
CheckpointingOptions.CHECKPOINTS_DIRECTORY through the following method:
{code:java}
Configuration configuration = new Configuration();
configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(configuration);{code}
However, configure the other type checkpoint storage by the job-side 
configuration like the following will not take effect:
{code:java}
Configuration configuration = new Configuration();
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(configuration);
configuration.set(CheckpointingOptions.CHECKPOINT_STORAGE, 
"aaa.bbb.ccc.CustomCheckpointStorage");
configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
{code}
This behavior is unexpected, we should allow this way will take effect.

  was:
When we configure the checkpoint storage at the job level, it can only be done 
through the following method:
{code:java}
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.getCheckpointConfig().setCheckpointStorage(xxx); {code}
However, configure the checkpoint storage by the job-side configuration like 
the following will not take effect:
{code:java}
Configuration configuration = new Configuration();
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(configuration);
configuration.set(CheckpointingOptions.CHECKPOINT_STORAGE, "filesystem");
configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
{code}
This behavior is unexpected, we should allow this way will take effect.


> The checkpoint storage configured in the job level by config option will not 
> take effect
> 
>
> Key: FLINK-33080
> URL: https://issues.apache.org/jira/browse/FLINK-33080
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
> Fix For: 1.19.0
>
>
> When we configure the checkpoint storage at the job level, it can only be 
> done through the following method:
> {code:java}
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> env.getCheckpointConfig().setCheckpointStorage(xxx); {code}
> or configure filesystem storage by config option 
> CheckpointingOptions.CHECKPOINTS_DIRECTORY through the following method:
> {code:java}
> Configuration configuration = new Configuration();
> configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment(configuration);{code}
> However, configure the other type checkpoint storage by the job-side 
> configuration like the following will not take effect:
> {code:java}
> Configuration configuration = new Configuration();
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment(configuration);
> configuration.set(CheckpointingOptions.CHECKPOINT_STORAGE, 
> "aaa.bbb.ccc.CustomCheckpointStorage");
> configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
> {code}
> This behavior is unexpected, we should allow this way will take effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-ml] lindong28 commented on a diff in pull request #248: [FLINK-32704] Supports spilling to disk when feedback channel memory buffer is full

2023-09-13 Thread via GitHub


lindong28 commented on code in PR #248:
URL: https://github.com/apache/flink-ml/pull/248#discussion_r1299531846


##
flink-ml-iteration/flink-ml-iteration-common/src/main/java/org/apache/flink/iteration/operator/feedback/SpillableFeedbackQueue.java:
##
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.iteration.operator.feedback;
+
+import org.apache.flink.annotation.Internal;
+import org.apache.flink.api.common.typeutils.TypeSerializer;
+import org.apache.flink.core.memory.DataInputView;
+import org.apache.flink.core.memory.DataOutputSerializer;
+import org.apache.flink.core.memory.MemorySegment;
+import org.apache.flink.runtime.io.disk.InputViewIterator;
+import org.apache.flink.runtime.io.disk.SpillingBuffer;
+import org.apache.flink.runtime.io.disk.iomanager.IOManager;
+import org.apache.flink.runtime.memory.MemoryAllocationException;
+import org.apache.flink.runtime.memory.MemoryManager;
+import org.apache.flink.table.runtime.operators.sort.ListMemorySegmentPool;
+import org.apache.flink.util.MutableObjectIterator;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Objects;
+
+/**
+ * * A queue that can spill the items to disks automatically when the memory 
buffer is full.
+ *
+ * @param  The element type.
+ */
+@Internal
+final class SpillableFeedbackQueue {
+private final DataOutputSerializer output = new DataOutputSerializer(256);
+private final TypeSerializer serializer;
+private final IOManager ioManager;
+private final MemoryManager memoryManager;
+private final int numPages;
+
+private List segments;
+private ListMemorySegmentPool segmentPool;
+
+private SpillingBuffer target;
+private long size = 0L;
+
+SpillableFeedbackQueue(
+IOManager ioManager,
+MemoryManager memoryManager,
+TypeSerializer serializer,
+long inMemoryBufferSize,
+long pageSize)
+throws MemoryAllocationException {
+this.serializer = Objects.requireNonNull(serializer);
+this.ioManager = Objects.requireNonNull(ioManager);
+this.memoryManager = Objects.requireNonNull(memoryManager);
+
+this.numPages = (int) (inMemoryBufferSize / pageSize);
+resetSpillingBuffer();
+}
+
+void add(T item) {
+try {
+output.clear();
+serializer.serialize(item, output);
+target.write(output.getSharedBuffer(), 0, output.length());
+size++;
+} catch (IOException e) {
+throw new IllegalStateException(e);
+}
+}
+
+MutableObjectIterator iterate() {
+try {
+DataInputView input = target.flip();
+return new InputViewIterator<>(input, this.serializer);
+} catch (IOException e) {
+throw new RuntimeException(e);
+}
+}
+
+long size() {
+return size;
+}
+
+void reset() throws Exception {
+size = 0;
+close();
+resetSpillingBuffer();

Review Comment:
   Just to summarize our offline discussions:
   - Currently MemoryManager#allocatePages will allocate memory from JVM rather 
than re-using allocated memory from a pool. We should avoid this repeated 
memory allocation.
   - AbstractPagedOutputView, the parent class of SpillingBuffer, provides 
clear() and advance() to support writing to a buffer after it has been read. It 
might be possible to let `SpillingBuffer` implement these two APIs so that we 
can re-use it for writing after it is read.
   - NormalizedKeySorter#reset supports re-using a buffer after it is read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan Mei reassigned FLINK-33052:


Assignee: Zakelly Lan  (was: Yuan Mei)

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Major
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] JunRuiLee opened a new pull request, #23408: [FLINK-33080][runtime] Make the configured checkpoint storage in the job-side configuration will take effect.

2023-09-13 Thread via GitHub


JunRuiLee opened a new pull request, #23408:
URL: https://github.com/apache/flink/pull/23408

   
   
   
   
   ## What is the purpose of the change
   
   Make the configured checkpoint storage in the job-side configuration will 
take effect.
   
   
   ## Brief change log
   
   Load checkpoint storage from configuration when configure checkpointConfig
   
   
   ## Verifying this change
   
   
   This change added tests and can be verified as follows:
   StreamContextEnvironmentTest#testDisallowCheckpointStorageByConfiguration
   StreamExecutionEnvironmentTest#testConfigureCheckpointStorage
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / **no**)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't 
know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / **no**)
 - If yes, how is the feature documented? (not applicable / docs / JavaDocs 
/ **not documented**)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-33080) The checkpoint storage configured in the job level by config option will not take effect

2023-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-33080:
---
Labels: pull-request-available  (was: )

> The checkpoint storage configured in the job level by config option will not 
> take effect
> 
>
> Key: FLINK-33080
> URL: https://issues.apache.org/jira/browse/FLINK-33080
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> When we configure the checkpoint storage at the job level, it can only be 
> done through the following method:
> {code:java}
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> env.getCheckpointConfig().setCheckpointStorage(xxx); {code}
> or configure filesystem storage by config option 
> CheckpointingOptions.CHECKPOINTS_DIRECTORY through the following method:
> {code:java}
> Configuration configuration = new Configuration();
> configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment(configuration);{code}
> However, configure the other type checkpoint storage by the job-side 
> configuration like the following will not take effect:
> {code:java}
> Configuration configuration = new Configuration();
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment(configuration);
> configuration.set(CheckpointingOptions.CHECKPOINT_STORAGE, 
> "aaa.bbb.ccc.CustomCheckpointStorage");
> configuration.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, checkpointDir); 
> {code}
> This behavior is unexpected, we should allow this way will take effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] flinkbot commented on pull request #23408: [FLINK-33080][runtime] Make the configured checkpoint storage in the job-side configuration will take effect.

2023-09-13 Thread via GitHub


flinkbot commented on PR #23408:
URL: https://github.com/apache/flink/pull/23408#issuecomment-1717151891

   
   ## CI report:
   
   * 9960939d8f00229a2cd201c2a9a87db4e90a5c10 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] gaborgsomogyi commented on pull request #23359: [FLINK-33029][python] Drop python 3.7 support

2023-09-13 Thread via GitHub


gaborgsomogyi commented on PR #23359:
URL: https://github.com/apache/flink/pull/23359#issuecomment-1717171777

   Asked in the dev discussion thread whether somebody has addition, nothing 
arrived so I would say yes. I'm intended to merge it soon...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (FLINK-32863) Improve Flink UI's time precision from second level to millisecond level

2023-09-13 Thread Yangze Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangze Guo reassigned FLINK-32863:
--

Assignee: Jufang He

> Improve Flink UI's time precision from second level to millisecond level
> 
>
> Key: FLINK-32863
> URL: https://issues.apache.org/jira/browse/FLINK-32863
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Affects Versions: 1.17.1
>Reporter: Runkang He
>Assignee: Jufang He
>Priority: Major
>  Labels: pull-request-available
>
> This an UI improvement for OLAP jobs.
> OLAP queries are generally small queries which will finish at the seconds or 
> milliseconds, but currently the time precision displayed is second level and 
> not enough for OLAP queries. Millisecond part of time is very important for 
> users and developers, to see accurate time, for performance measurement and 
> optimization. The displayed time includes job duration, task duration, task 
> start time, end time and so on.
> It would be nice to improve this for better OLAP user experience.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] KarmaGYZ commented on a diff in pull request #23403: [FLINK-32863][runtime-web] Improve Flink UI's time precision from sec…

2023-09-13 Thread via GitHub


KarmaGYZ commented on code in PR #23403:
URL: https://github.com/apache/flink/pull/23403#discussion_r1324171749


##
flink-runtime-web/web-dashboard/src/app/components/humanize-duration.pipe.ts:
##
@@ -46,23 +46,23 @@ export class HumanizeDurationPipe implements PipeTransform {
 if (seconds === 0) {
   return `${ms}ms`;
 } else {
-  return `${seconds}s`;
+  return `${seconds}s ${ms}ms`;
 }
   } else {
-return `${minutes}m ${seconds}s`;
+return `${minutes}m ${seconds}s ${ms}ms`;
   }
 } else {
   if (short) {
 return `${hours}h ${minutes}m`;
   } else {
-return `${hours}h ${minutes}m ${seconds}s`;
+return `${hours}h ${minutes}m ${seconds}s ${ms}ms`;
   }
 }
   } else {
 if (short) {
   return `${days}d ${hours}h`;
 } else {
-  return `${days}d ${hours}h ${minutes}m ${seconds}s`;
+  return `${days}d ${hours}h ${minutes}m ${seconds}s ${ms}ms`;

Review Comment:
   nit: I think the explicit ms might be useless for jobs longer than 1h.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (FLINK-31387) StreamTaskCancellationTest.testCancelTaskShouldPreventAdditionalProcessingTimeTimersFromBeingFired failed with an assertion

2023-09-13 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-31387.
--

> StreamTaskCancellationTest.testCancelTaskShouldPreventAdditionalProcessingTimeTimersFromBeingFired
>  failed with an assertion
> ---
>
> Key: FLINK-31387
> URL: https://issues.apache.org/jira/browse/FLINK-31387
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: David Morávek
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.18.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46994&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=9253
> {code}
> Mar 09 14:04:42 [ERROR] 
> org.apache.flink.streaming.runtime.tasks.StreamTaskCancellationTest.testCancelTaskShouldPreventAdditionalProcessingTimeTimersFromBeingFired
>   Time elapsed: 0.018 s  <<< FAILURE!
> Mar 09 14:04:42 java.lang.AssertionError: 
> Mar 09 14:04:42 
> Mar 09 14:04:42 Expecting AtomicInteger(0) to have value:
> Mar 09 14:04:42   10
> Mar 09 14:04:42 but did not.
> Mar 09 14:04:42   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskCancellationTest.testCancelTaskShouldPreventAdditionalTimersFromBeingFired(StreamTaskCancellationTest.java:305)
> Mar 09 14:04:42   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskCancellationTest.testCancelTaskShouldPreventAdditionalProcessingTimeTimersFromBeingFired(StreamTaskCancellationTest.java:281)
> Mar 09 14:04:42   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> [...]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] hejufang commented on a diff in pull request #23403: [FLINK-32863][runtime-web] Improve Flink UI's time precision from sec…

2023-09-13 Thread via GitHub


hejufang commented on code in PR #23403:
URL: https://github.com/apache/flink/pull/23403#discussion_r1324189387


##
flink-runtime-web/web-dashboard/src/app/components/humanize-duration.pipe.ts:
##
@@ -46,23 +46,23 @@ export class HumanizeDurationPipe implements PipeTransform {
 if (seconds === 0) {
   return `${ms}ms`;
 } else {
-  return `${seconds}s`;
+  return `${seconds}s ${ms}ms`;
 }
   } else {
-return `${minutes}m ${seconds}s`;
+return `${minutes}m ${seconds}s ${ms}ms`;
   }
 } else {
   if (short) {
 return `${hours}h ${minutes}m`;
   } else {
-return `${hours}h ${minutes}m ${seconds}s`;
+return `${hours}h ${minutes}m ${seconds}s ${ms}ms`;
   }
 }
   } else {
 if (short) {
   return `${days}d ${hours}h`;
 } else {
-  return `${days}d ${hours}h ${minutes}m ${seconds}s`;
+  return `${days}d ${hours}h ${minutes}m ${seconds}s ${ms}ms`;

Review Comment:
   Thanks for your suggestion, it has been fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (FLINK-33081) Move parallelism override logic into scale method

2023-09-13 Thread Gyula Fora (Jira)
Gyula Fora created FLINK-33081:
--

 Summary: Move parallelism override logic into scale method
 Key: FLINK-33081
 URL: https://issues.apache.org/jira/browse/FLINK-33081
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Gyula Fora
Assignee: Gyula Fora
 Fix For: kubernetes-operator-1.7.0


After FLINK-32589  the parallelism overrides are applied separately from the 
scale call of the autoscaler implementation. We should simplify this by a small 
refactoring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] maosuhan commented on pull request #23162: [FLINK-32650][protobuf]Added the ability to split flink-protobuf code…

2023-09-13 Thread via GitHub


maosuhan commented on PR #23162:
URL: https://github.com/apache/flink/pull/23162#issuecomment-1717237735

   @ljw-hit Hi, thanks for your effort and the code is already in good shape to 
me. I have left a few comments about unit tests. And could you provide a 
benchmark test for this improvement? For example, how much time of 
encoding/decoding 10M large rows can be saved after this improvement..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-33052) codespeed server is down

2023-09-13 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-33052:
---
Priority: Blocker  (was: Major)

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33052) codespeed server is down

2023-09-13 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-33052:
---
Affects Version/s: 1.18.0

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33052) codespeed server is down

2023-09-13 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-33052:
---
Component/s: Benchmarks

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33052) codespeed server is down

2023-09-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764564#comment-17764564
 ] 

Piotr Nowojski commented on FLINK-33052:


Thanks for investigating this issue [~jingge]. Let me know [~ym] if you will be 
having troubles setting this up again. Has the backup FLINK-30890 been 
preserved?

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33081) Move parallelism override logic into scale method

2023-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-33081:
---
Labels: pull-request-available  (was: )

> Move parallelism override logic into scale method
> -
>
> Key: FLINK-33081
> URL: https://issues.apache.org/jira/browse/FLINK-33081
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Gyula Fora
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.7.0
>
>
> After FLINK-32589  the parallelism overrides are applied separately from the 
> scale call of the autoscaler implementation. We should simplify this by a 
> small refactoring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-kubernetes-operator] gyfora opened a new pull request, #672: [FLINK-33081] Apply parallelism overrides in scale

2023-09-13 Thread via GitHub


gyfora opened a new pull request, #672:
URL: https://github.com/apache/flink-kubernetes-operator/pull/672

   ## What is the purpose of the change
   
   Simplify the scale / parallelism override flow based on previous feedback. 
This change is a refactor and does not introduce new behaviour.
   
   ## Brief change log
   
 - *Unify scale / applyParallelismOverride methods*
 - *Move scaling before reconciling spec diffs*
 - *Refactor JobAutoscalerImpl and extract some methods to simplify the 
core flow*
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
 - Core observer or reconciler logic that is regularly executed: yes
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] echauchot commented on a diff in pull request #22985: [FLINK-21883][scheduler] Implement cooldown period for adaptive scheduler

2023-09-13 Thread via GitHub


echauchot commented on code in PR #22985:
URL: https://github.com/apache/flink/pull/22985#discussion_r1324233808


##
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java:
##
@@ -124,17 +154,33 @@ private void handleDeploymentFailure(ExecutionVertex 
executionVertex, JobExcepti
 
 @Override
 public void onNewResourcesAvailable() {
-maybeRescale();
+rescaleWhenCooldownPeriodIsOver();
 }
 
 @Override
 public void onNewResourceRequirements() {
-maybeRescale();
+rescaleWhenCooldownPeriodIsOver();
 }
 
 private void maybeRescale() {
-if (context.shouldRescale(getExecutionGraph())) {
-getLogger().info("Can change the parallelism of job. Restarting 
job.");
+final Duration timeSinceLastRescale = timeSinceLastRescale();
+rescaleScheduled = false;
+final boolean shouldForceRescale =
+(scalingIntervalMax != null)
+&& (timeSinceLastRescale.compareTo(scalingIntervalMax) 
> 0)
+&& (lastRescale != Instant.EPOCH); // initial rescale 
is not forced
+if (shouldForceRescale || context.shouldRescale(getExecutionGraph())) {
+if (shouldForceRescale) {
+getLogger()
+.info(
+"Time since last rescale ({}) >  {} ({}). 
Force-changing the parallelism of the job. Restarting the job.",
+timeSinceLastRescale,
+
JobManagerOptions.SCHEDULER_SCALING_INTERVAL_MAX.key(),
+scalingIntervalMax);
+} else {
+getLogger().info("Can change the parallelism of the job. 
Restarting the job.");
+}
+lastRescale = Instant.now();
 context.goToRestarting(
 getExecutionGraph(),

Review Comment:
   Here is the summary of the offline discussion @1996fanrui and I had about 
`scalingIntervalMax` and forcing rescale (current code: when `added resources < 
min-parallelism-increase` we force a rescale if `timeSinceLastRescale > 
scalingIntervalMax`):
   1. aim: the longer the pipeline runs, the more the (small) resource gain is 
worth the restarting time.
   2. corner case: a resource `< min-parallelism-increase` arrives when 
`timeSinceLastRescale < scalingIntervalMax` and the pipeline is running for a 
long time (typical case 1) => with the current code, we don't force a rescale 
in that case whereas the added resource would be worth the restarting time.
   => I proposed solution 1: changing the definition of  `scalingIntervalMax` 
to `pipelineRuntimeRescaleMin` meaning pipeline runtime after which we force a 
rescale even if `added resources < min-parallelism-increase`
   => @1996fanrui  proposed solution 2: if `added resources < 
min-parallelism-increase && timeSinceLastRescale < scalingIntervalMax` schedule 
a tryRescale (force a rescale if there is indeed a change in the resource graph 
at that time in case the last TM crashed) after `scalingIntervalMax`
   
   @zentol you were the one who proposed the addition of `scalingIntervalMax` 
in the FLIP discussion thread. Do you prefer solution 1 or solution 2 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-18356) flink-table-planner Exit code 137 returned from process

2023-09-13 Thread Sergey Nuyanzin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764565#comment-17764565
 ] 

Sergey Nuyanzin commented on FLINK-18356:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=53164&view=logs&j=a9db68b9-a7e0-54b6-0f98-010e0aff39e2&t=cdd32e0b-6047-565b-c58f-14054472f1be&l=11710

> flink-table-planner Exit code 137 returned from process
> ---
>
> Key: FLINK-18356
> URL: https://issues.apache.org/jira/browse/FLINK-18356
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines, Tests
>Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0, 1.18.0
>Reporter: Piotr Nowojski
>Assignee: Yunhong Zheng
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Attachments: 1234.jpg, app-profiling_4.gif, 
> image-2023-01-11-22-21-57-784.png, image-2023-01-11-22-22-32-124.png, 
> image-2023-02-16-20-18-09-431.png, image-2023-07-11-19-28-52-851.png, 
> image-2023-07-11-19-35-54-530.png, image-2023-07-11-19-41-18-626.png, 
> image-2023-07-11-19-41-37-105.png
>
>
> {noformat}
> = test session starts 
> ==
> platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
> cachedir: .tox/py37-cython/.pytest_cache
> rootdir: /__w/3/s/flink-python
> collected 568 items
> pyflink/common/tests/test_configuration.py ..[  
> 1%]
> pyflink/common/tests/test_execution_config.py ...[  
> 5%]
> pyflink/dataset/tests/test_execution_environment.py .
> ##[error]Exit code 137 returned from process: file name '/bin/docker', 
> arguments 'exec -i -u 1002 
> 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> Finishing: Test - python
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729&view=logs&j=9cada3cb-c1d3-5621-16da-0f718fb86602&t=8d78fe4f-d658-5c70-12f8-4921589024c3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33082) Azure Pipelines 4 is not responding on AZP

2023-09-13 Thread Sergey Nuyanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Nuyanzin updated FLINK-33082:

Affects Version/s: 1.17.2

> Azure Pipelines 4 is not responding on AZP
> --
>
> Key: FLINK-33082
> URL: https://issues.apache.org/jira/browse/FLINK-33082
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.17.2
>Reporter: Sergey Nuyanzin
>Priority: Major
>  Labels: test-stability
>
> it impacts this build for 1.17 
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=53163&view=logs&s=9fca669f-5c5f-59c7-4118-e31c641064f0&j=6e8542d7-de38-5a33-4aca-458d6c87066d]
> {noformat}
> ##[error]We stopped hearing from agent Azure Pipelines 4. Verify the agent 
> machine is running and has a healthy network connection. Anything that 
> terminates an agent process, starves it for CPU, or blocks its network access 
> can cause this error. For more information, see: 
> https://go.microsoft.com/fwlink/?linkid=846610
> Agent: Azure Pipelines 4
> Started: Today at 4:59 AM
> Duration: 6h 24m 38s
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33082) Azure Pipelines 4 is not responding on AZP

2023-09-13 Thread Sergey Nuyanzin (Jira)
Sergey Nuyanzin created FLINK-33082:
---

 Summary: Azure Pipelines 4 is not responding on AZP
 Key: FLINK-33082
 URL: https://issues.apache.org/jira/browse/FLINK-33082
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Reporter: Sergey Nuyanzin


it impacts this build for 1.17 
[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=53163&view=logs&s=9fca669f-5c5f-59c7-4118-e31c641064f0&j=6e8542d7-de38-5a33-4aca-458d6c87066d]

{noformat}
##[error]We stopped hearing from agent Azure Pipelines 4. Verify the agent 
machine is running and has a healthy network connection. Anything that 
terminates an agent process, starves it for CPU, or blocks its network access 
can cause this error. For more information, see: 
https://go.microsoft.com/fwlink/?linkid=846610
Agent: Azure Pipelines 4
Started: Today at 4:59 AM
Duration: 6h 24m 38s
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-18356) flink-table-planner Exit code 137 returned from process

2023-09-13 Thread Sergey Nuyanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Nuyanzin updated FLINK-18356:

Affects Version/s: 1.19.0

> flink-table-planner Exit code 137 returned from process
> ---
>
> Key: FLINK-18356
> URL: https://issues.apache.org/jira/browse/FLINK-18356
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines, Tests
>Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0, 1.18.0, 
> 1.19.0
>Reporter: Piotr Nowojski
>Assignee: Yunhong Zheng
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Attachments: 1234.jpg, app-profiling_4.gif, 
> image-2023-01-11-22-21-57-784.png, image-2023-01-11-22-22-32-124.png, 
> image-2023-02-16-20-18-09-431.png, image-2023-07-11-19-28-52-851.png, 
> image-2023-07-11-19-35-54-530.png, image-2023-07-11-19-41-18-626.png, 
> image-2023-07-11-19-41-37-105.png
>
>
> {noformat}
> = test session starts 
> ==
> platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
> cachedir: .tox/py37-cython/.pytest_cache
> rootdir: /__w/3/s/flink-python
> collected 568 items
> pyflink/common/tests/test_configuration.py ..[  
> 1%]
> pyflink/common/tests/test_execution_config.py ...[  
> 5%]
> pyflink/dataset/tests/test_execution_environment.py .
> ##[error]Exit code 137 returned from process: file name '/bin/docker', 
> arguments 'exec -i -u 1002 
> 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> Finishing: Test - python
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729&view=logs&j=9cada3cb-c1d3-5621-16da-0f718fb86602&t=8d78fe4f-d658-5c70-12f8-4921589024c3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-18356) flink-table-planner Exit code 137 returned from process

2023-09-13 Thread Sergey Nuyanzin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764568#comment-17764568
 ] 

Sergey Nuyanzin commented on FLINK-18356:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=53161&view=logs&j=a9db68b9-a7e0-54b6-0f98-010e0aff39e2&t=cdd32e0b-6047-565b-c58f-14054472f1be&l=12718

> flink-table-planner Exit code 137 returned from process
> ---
>
> Key: FLINK-18356
> URL: https://issues.apache.org/jira/browse/FLINK-18356
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines, Tests
>Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0, 1.18.0
>Reporter: Piotr Nowojski
>Assignee: Yunhong Zheng
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Attachments: 1234.jpg, app-profiling_4.gif, 
> image-2023-01-11-22-21-57-784.png, image-2023-01-11-22-22-32-124.png, 
> image-2023-02-16-20-18-09-431.png, image-2023-07-11-19-28-52-851.png, 
> image-2023-07-11-19-35-54-530.png, image-2023-07-11-19-41-18-626.png, 
> image-2023-07-11-19-41-37-105.png
>
>
> {noformat}
> = test session starts 
> ==
> platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
> cachedir: .tox/py37-cython/.pytest_cache
> rootdir: /__w/3/s/flink-python
> collected 568 items
> pyflink/common/tests/test_configuration.py ..[  
> 1%]
> pyflink/common/tests/test_execution_config.py ...[  
> 5%]
> pyflink/dataset/tests/test_execution_environment.py .
> ##[error]Exit code 137 returned from process: file name '/bin/docker', 
> arguments 'exec -i -u 1002 
> 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> Finishing: Test - python
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729&view=logs&j=9cada3cb-c1d3-5621-16da-0f718fb86602&t=8d78fe4f-d658-5c70-12f8-4921589024c3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] echauchot commented on a diff in pull request #22985: [FLINK-21883][scheduler] Implement cooldown period for adaptive scheduler

2023-09-13 Thread via GitHub


echauchot commented on code in PR #22985:
URL: https://github.com/apache/flink/pull/22985#discussion_r1324233808


##
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java:
##
@@ -124,17 +154,33 @@ private void handleDeploymentFailure(ExecutionVertex 
executionVertex, JobExcepti
 
 @Override
 public void onNewResourcesAvailable() {
-maybeRescale();
+rescaleWhenCooldownPeriodIsOver();
 }
 
 @Override
 public void onNewResourceRequirements() {
-maybeRescale();
+rescaleWhenCooldownPeriodIsOver();
 }
 
 private void maybeRescale() {
-if (context.shouldRescale(getExecutionGraph())) {
-getLogger().info("Can change the parallelism of job. Restarting 
job.");
+final Duration timeSinceLastRescale = timeSinceLastRescale();
+rescaleScheduled = false;
+final boolean shouldForceRescale =
+(scalingIntervalMax != null)
+&& (timeSinceLastRescale.compareTo(scalingIntervalMax) 
> 0)
+&& (lastRescale != Instant.EPOCH); // initial rescale 
is not forced
+if (shouldForceRescale || context.shouldRescale(getExecutionGraph())) {
+if (shouldForceRescale) {
+getLogger()
+.info(
+"Time since last rescale ({}) >  {} ({}). 
Force-changing the parallelism of the job. Restarting the job.",
+timeSinceLastRescale,
+
JobManagerOptions.SCHEDULER_SCALING_INTERVAL_MAX.key(),
+scalingIntervalMax);
+} else {
+getLogger().info("Can change the parallelism of the job. 
Restarting the job.");
+}
+lastRescale = Instant.now();
 context.goToRestarting(
 getExecutionGraph(),

Review Comment:
   Here is the summary of the offline discussion @1996fanrui and I had about 
`scalingIntervalMax` and forcing rescale (current code: when `added resources < 
min-parallelism-increase` we force a rescale if `timeSinceLastRescale > 
scalingIntervalMax`):
   1. aim: the longer the pipeline runs, the more the (small) resource gain is 
worth the restarting time.
   2. corner case: a resource `< min-parallelism-increase` arrives when 
`timeSinceLastRescale < scalingIntervalMax` and the pipeline is running for a 
long time (typical case 1) => with the current code, we don't force a rescale 
in that case whereas the added resource would be worth the restarting time.
   => I proposed solution 1: changing the definition of  `scalingIntervalMax` 
to `pipelineRuntimeRescaleMin` meaning pipeline runtime after which we force a 
rescale even if `added resources < min-parallelism-increase`
   => @1996fanrui  proposed solution 2: if `added resources < 
min-parallelism-increase && timeSinceLastRescale < scalingIntervalMax` schedule 
a tryRescale after `scalingIntervalMax` : force a rescale if there is indeed a 
change in the resource graph at that time in case the last TM crashed 
   
   @zentol you were the one who proposed the addition of `scalingIntervalMax` 
in the FLIP discussion thread. Do you prefer solution 1 or solution 2 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] echauchot commented on pull request #22985: [FLINK-21883][scheduler] Implement cooldown period for adaptive scheduler

2023-09-13 Thread via GitHub


echauchot commented on PR #22985:
URL: https://github.com/apache/flink/pull/22985#issuecomment-1717288424

   > @dmvk as you authored part on the code in that part, can you review the PR 
as well ?
   
   The discussions went around `scalingIntervalMax`, @zentol was the one who 
proposed the addition of this parameter in the FLIP discussion thread. So maybe 
it is not needed that we add both of you as reviewers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (FLINK-33083) SupportsReadingMetadata is not applied when loading a CompiledPlan

2023-09-13 Thread Dawid Wysakowicz (Jira)
Dawid Wysakowicz created FLINK-33083:


 Summary: SupportsReadingMetadata is not applied when loading a 
CompiledPlan
 Key: FLINK-33083
 URL: https://issues.apache.org/jira/browse/FLINK-33083
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.17.1, 1.16.2
Reporter: Dawid Wysakowicz


If a few conditions are met, we can not apply ReadingMetadata interface:
# source overwrites:
 {code}
@Override
public boolean supportsMetadataProjection() {
return false;
}
 {code}
# source does not implement {{SupportsProjectionPushDown}}
# table has metadata columns e.g.
{code}
CREATE TABLE src (
  physical_name STRING,
  physical_sum INT,
  timestamp TIMESTAMP_LTZ(3) NOT NULL METADATA VIRTUAL
)
{code}
# we query the table {{SELECT * FROM src}}

It fails with:
{code}
Caused by: java.lang.IllegalArgumentException: Row arity: 1, but serializer 
arity: 2
at 
org.apache.flink.table.runtime.typeutils.RowDataSerializer.copy(RowDataSerializer.java:124)
{code}

The reason is {{SupportsReadingMetadataSpec}} is created only in the 
{{PushProjectIntoTableSourceScanRule}}, but the rule is not applied when 1 & 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764570#comment-17764570
 ] 

Yuan Mei commented on FLINK-33052:
--

[~pnowojski] We have the historical data backed up. But just in case, we will 
still have the worker node there before the new setup.

 

Also, I would say this ticket is not a blocker for release 1.18, since 1.18 is 
already branch-cut and in the stage of manual testing?

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan Mei updated FLINK-33052:
-
Priority: Critical  (was: Blocker)

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Critical
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] wangyang0918 commented on pull request #23164: [FLINK- 32775]Add parent dir of files to classpath using yarn.provided.lib.dirs

2023-09-13 Thread via GitHub


wangyang0918 commented on PR #23164:
URL: https://github.com/apache/flink/pull/23164#issuecomment-1717326350

   @architgyl Could you please verify this PR in a real YARN cluster whether it 
solves your original requirement about hive config?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] mbalassi commented on pull request #668: Give cluster/job role access to k8s services API

2023-09-13 Thread via GitHub


mbalassi commented on PR #668:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/668#issuecomment-1717336055

   @sbrother could you please open a ticket for this, as a good example see 
this one:
   https://issues.apache.org/jira/browse/FLINK-33066
   
   Could you clarify what do you mean by "connect to a session cluster"? Did 
you mean executing `flink list` from one of containers in your JobManager pod?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] mbalassi commented on pull request #671: [FLINK-33066] Support all k8s methods to configure env variable in operatorPod

2023-09-13 Thread via GitHub


mbalassi commented on PR #671:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/671#issuecomment-1717348931

   Thanks, @dongwoo6kim. Please include the documentation in this change, 
unfortunately we maintain that manually here:
   
https://github.com/apache/flink-kubernetes-operator/blob/main/docs/content/docs/operations/helm.md?plain=1#L73


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] TanYuxin-tyx commented on pull request #23401: [FLINK-30906][task] TwoInputStreamTask and MultipleInputStreamTask passes wrong configuration when create input processor

2023-09-13 Thread via GitHub


TanYuxin-tyx commented on PR #23401:
URL: https://github.com/apache/flink/pull/23401#issuecomment-1717360318

   @reswqa Thanks for fixing the configuration bug. LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] reswqa commented on pull request #23401: [FLINK-30906][task] TwoInputStreamTask and MultipleInputStreamTask passes wrong configuration when create input processor

2023-09-13 Thread via GitHub


reswqa commented on PR #23401:
URL: https://github.com/apache/flink/pull/23401#issuecomment-1717361467

   Thanks! merging...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] reswqa merged pull request #23401: [FLINK-30906][task] TwoInputStreamTask and MultipleInputStreamTask passes wrong configuration when create input processor

2023-09-13 Thread via GitHub


reswqa merged PR #23401:
URL: https://github.com/apache/flink/pull/23401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (FLINK-33084) Migrate globalJobParameter in ExecutionConfig to configuration instance

2023-09-13 Thread Junrui Li (Jira)
Junrui Li created FLINK-33084:
-

 Summary: Migrate globalJobParameter in ExecutionConfig to 
configuration instance
 Key: FLINK-33084
 URL: https://issues.apache.org/jira/browse/FLINK-33084
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Configuration
Reporter: Junrui Li
 Fix For: 1.19.0


Currently, the globalJobParameter field in ExecutionConfig has not been 
migrated to the Configuration. Considering the goal of unifying configuration 
options, it is necessary to migrate it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33052) codespeed server is down

2023-09-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764581#comment-17764581
 ] 

Piotr Nowojski commented on FLINK-33052:


We have been always treating benchmarking issues the same way as regular test 
issues. If CI for tests was down, that would have been a clear blocker, 
preventing everyone from merging any code, doing any releases. The same applies 
for benchmarking, the only difference is that benchmarks are asynchronous and 
not executed per every PR due to load that they are generating.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Critical
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-30906) TwoInputStreamTask passes wrong configuration object when creating input processor

2023-09-13 Thread Weijie Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weijie Guo closed FLINK-30906.
--
Resolution: Fixed

master(1.19) via 70f4c40f15f38ed404d8e031a08d534326535ced.

> TwoInputStreamTask passes wrong configuration object when creating input 
> processor
> --
>
> Key: FLINK-30906
> URL: https://issues.apache.org/jira/browse/FLINK-30906
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.17.0, 1.16.1
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> It seems _StreamTwoInputProcessorFactory.create_ is passed with wrong 
> configuration object: the taskManagerConfiguration should be __ 
> _getEnvironment().getTaskManagerInfo().getConfiguration()._ 
>  
> And in the following logic, it seems to indeed try to load taskmanager 
> options from this configuration object, like state-backend and 
> taskmanager.memory.managed.consumer-weights 
>  
> [1]https://github.com/apache/flink/blob/111342f37bdc0d582d3f7af458d9869f0548299f/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/TwoInputStreamTask.java#L98



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30906) TwoInputStreamTask passes wrong configuration object when creating input processor

2023-09-13 Thread Weijie Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weijie Guo updated FLINK-30906:
---
Fix Version/s: 1.19.0

> TwoInputStreamTask passes wrong configuration object when creating input 
> processor
> --
>
> Key: FLINK-30906
> URL: https://issues.apache.org/jira/browse/FLINK-30906
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.17.0, 1.16.1
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> It seems _StreamTwoInputProcessorFactory.create_ is passed with wrong 
> configuration object: the taskManagerConfiguration should be __ 
> _getEnvironment().getTaskManagerInfo().getConfiguration()._ 
>  
> And in the following logic, it seems to indeed try to load taskmanager 
> options from this configuration object, like state-backend and 
> taskmanager.memory.managed.consumer-weights 
>  
> [1]https://github.com/apache/flink/blob/111342f37bdc0d582d3f7af458d9869f0548299f/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/TwoInputStreamTask.java#L98



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33052) codespeed server is down

2023-09-13 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-33052:
---
Priority: Blocker  (was: Critical)

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] TanYuxin-tyx commented on pull request #23404: [FLINK-33076][network] Reduce serialization overhead of broadcast emit from ChannelSelectorRecordWriter.

2023-09-13 Thread via GitHub


TanYuxin-tyx commented on PR #23404:
URL: https://github.com/apache/flink/pull/23404#issuecomment-1717364731

   Thanks @reswqa for fixing the bug. LGTM now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764581#comment-17764581
 ] 

Piotr Nowojski edited comment on FLINK-33052 at 9/13/23 10:28 AM:
--

We have been always treating benchmarking issues the same way as regular test 
issues. If CI for tests was down, that would have been a clear blocker, 
preventing everyone from merging any code, doing any releases. The same applies 
for benchmarking, the only difference is that benchmarks are asynchronous and 
not executed per every PR due to load that they are generating.

Some of the past examples:
FLINK-23153, FLINK-23879, FLINK-29886, FLINK-30015, FLINK-15171

Also as far as I remember that has been discussed on the dev mailing list at 
least once or twice.


was (Author: pnowojski):
We have been always treating benchmarking issues the same way as regular test 
issues. If CI for tests was down, that would have been a clear blocker, 
preventing everyone from merging any code, doing any releases. The same applies 
for benchmarking, the only difference is that benchmarks are asynchronous and 
not executed per every PR due to load that they are generating.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] reswqa commented on pull request #23404: [FLINK-33076][network] Reduce serialization overhead of broadcast emit from ChannelSelectorRecordWriter.

2023-09-13 Thread via GitHub


reswqa commented on PR #23404:
URL: https://github.com/apache/flink/pull/23404#issuecomment-1717367967

   Thanks for the review! Squashed the fix-up commit. I will merge this if 
pipeline passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (FLINK-33084) Migrate globalJobParameter in ExecutionConfig to configuration instance

2023-09-13 Thread Lijie Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijie Wang reassigned FLINK-33084:
--

Assignee: Junrui Li

> Migrate globalJobParameter in ExecutionConfig to configuration instance
> ---
>
> Key: FLINK-33084
> URL: https://issues.apache.org/jira/browse/FLINK-33084
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Configuration
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
> Fix For: 1.19.0
>
>
> Currently, the globalJobParameter field in ExecutionConfig has not been 
> migrated to the Configuration. Considering the goal of unifying configuration 
> options, it is necessary to migrate it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33084) Migrate globalJobParameter in ExecutionConfig to configuration instance

2023-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-33084:
---
Labels: pull-request-available  (was: )

> Migrate globalJobParameter in ExecutionConfig to configuration instance
> ---
>
> Key: FLINK-33084
> URL: https://issues.apache.org/jira/browse/FLINK-33084
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Configuration
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Currently, the globalJobParameter field in ExecutionConfig has not been 
> migrated to the Configuration. Considering the goal of unifying configuration 
> options, it is necessary to migrate it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] JunRuiLee opened a new pull request, #23409: [FLINK-33084][runtime] Migrate globalJobParameter to configuration.

2023-09-13 Thread via GitHub


JunRuiLee opened a new pull request, #23409:
URL: https://github.com/apache/flink/pull/23409

   
   
   ## What is the purpose of the change
   
   Migrate globalJobParameter to configuration.
   
   ## Brief change log
   
   Migrate globalJobParameter to configuration.
   
   
   ## Verifying this change
   
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / **no**)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't 
know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / **no**)
 - If yes, how is the feature documented? (not applicable / docs / JavaDocs 
/ **not documented**)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] wangzzu commented on a diff in pull request #23399: [FLINK-33061][docs] Translate failure-enricher documentation to Chinese

2023-09-13 Thread via GitHub


wangzzu commented on code in PR #23399:
URL: https://github.com/apache/flink/pull/23399#discussion_r1324344461


##
docs/content/docs/deployment/advanced/failure_enrichers.md:
##
@@ -42,7 +42,7 @@ To implement a custom FailureEnricher plugin, you need to:
 
 Then, create a jar which includes your `FailureEnricher`, 
`FailureEnricherFactory`, `META-INF/services/` and all external dependencies.
 Make a directory in `plugins/` of your Flink distribution with an arbitrary 
name, e.g. "failure-enrichment", and put the jar into this directory.
-See [Flink Plugin]({% link deployment/filesystems/plugins.md %}) for more 
details.
+See [Flink Plugin]({{< ref "docs/deployment/filesystems/plugins" >}}) for more 
details.

Review Comment:
   Good advice, here I fixed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #23409: [FLINK-33084][runtime] Migrate globalJobParameter to configuration.

2023-09-13 Thread via GitHub


flinkbot commented on PR #23409:
URL: https://github.com/apache/flink/pull/23409#issuecomment-1717412004

   
   ## CI report:
   
   * 20e1f350e3ef6dd6a37701317ea7a9feccab9b0e UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-21949) Support ARRAY_AGG aggregate function

2023-09-13 Thread Jiabao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiabao Sun updated FLINK-21949:
---
Summary: Support ARRAY_AGG aggregate function  (was: Support collect to 
array aggregate function)

> Support ARRAY_AGG aggregate function
> 
>
> Key: FLINK-21949
> URL: https://issues.apache.org/jira/browse/FLINK-21949
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / API
>Affects Versions: 1.12.0
>Reporter: Jiabao Sun
>Assignee: Jiabao Sun
>Priority: Minor
> Fix For: 1.19.0
>
>
> Some nosql databases like mongodb and elasticsearch support nested data types.
> Aggregating multiple rows into ARRAY is a common requirement.
> The CollectToArray function is similar to Collect, except that it returns 
> ARRAY instead of MULTISET.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33085) Improve the error message when the invalidate lookupTableSource without primary key is used as temporal join table

2023-09-13 Thread Yunhong Zheng (Jira)
Yunhong Zheng created FLINK-33085:
-

 Summary: Improve the error message when the invalidate 
lookupTableSource without primary key is used as temporal join table
 Key: FLINK-33085
 URL: https://issues.apache.org/jira/browse/FLINK-33085
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Affects Versions: 1.19.0
Reporter: Yunhong Zheng
 Fix For: 1.19.0


Improve the error message when the invalidate lookupTableSource without primary 
key is used as temporal join table.  This pr can check the legality of 
temporary table join syntax in sqlToRel phase and make the thrown error clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-33011) Operator deletes HA data unexpectedly

2023-09-13 Thread Gyula Fora (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-33011.
--
Fix Version/s: kubernetes-operator-1.7.0
   Resolution: Fixed

merged to main 82739f62adda33e686da7d8aa30cbd41ea13012f

> Operator deletes HA data unexpectedly
> -
>
> Key: FLINK-33011
> URL: https://issues.apache.org/jira/browse/FLINK-33011
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: 1.17.1, kubernetes-operator-1.6.0
> Environment: Flink: 1.17.1
> Flink Kubernetes Operator: 1.6.0
>Reporter: Ruibin Xing
>Assignee: Gyula Fora
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.7.0
>
> Attachments: flink_operator_logs_0831.csv
>
>
> We encountered a problem where the operator unexpectedly deleted HA data.
> The timeline is as follows:
> 12:08 We submitted the first spec, which suspended the job with savepoint 
> upgrade mode.
> 12:08 The job was suspended, while the HA data was preserved, and the log 
> showed the observed job deployment status was MISSING.
> 12:10 We submitted the second spec, which deployed the job with the last 
> state upgrade mode.
> 12:10 Logs showed the operator deleted both the Flink deployment and the HA 
> data again.
> 12:10 The job failed to start because the HA data was missing.
> According to the log, the deletion was triggered by 
> https://github.com/apache/flink-kubernetes-operator/blob/a728ba768e20236184e2b9e9e45163304b8b196c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L168
> I think this would only be triggered if the job deployment status wasn't 
> MISSING. But the log before the deletion showed the observed job status was 
> MISSING at that moment.
> Related logs:
>  
> {code:java}
> 2023-08-30 12:08:48.190 + o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][default/pipeline-pipeline-se-3] Cluster shutdown completed.
> 2023-08-30 12:10:27.010 + o.a.f.k.o.o.d.ApplicationObserver [INFO 
> ][default/pipeline-pipeline-se-3] Observing JobManager deployment. Previous 
> status: MISSING
> 2023-08-30 12:10:27.533 + o.a.f.k.o.l.AuditUtils         [INFO 
> ][default/pipeline-pipeline-se-3] >>> Event  | Info    | SPECCHANGED     | 
> UPGRADE change(s) detected (Diff: FlinkDeploymentSpec[image : 
> docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:0835137c-362
>  -> 
> docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:23db7ae8-365,
>  podTemplate.metadata.labels.app.kubernetes.io~1version : 
> 0835137cd803b7258695eb53a6ec520cb62a48a7 -> 
> 23db7ae84bdab8d91fa527fe2f8f2fce292d0abc, job.state : suspended -> running, 
> job.upgradeMode : last-state -> savepoint, restartNonce : 1545 -> 1547]), 
> starting reconciliation.
> 2023-08-30 12:10:27.679 + o.a.f.k.o.s.NativeFlinkService [INFO 
> ][default/pipeline-pipeline-se-3] Deleting JobManager deployment and HA 
> metadata.
> {code}
> A more complete log file is attached. Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] swuferhong opened a new pull request, #23410: [FLINK-33085][table-planner] Improve the error message when the invalidated lookupTableSource without primary key is used as temporal joi

2023-09-13 Thread via GitHub


swuferhong opened a new pull request, #23410:
URL: https://github.com/apache/flink/pull/23410

   
   
   ## What is the purpose of the change
   
   Improve the error message when the invalidate `lookupTableSource` without 
primary key is used as temporal join table.  This pr can check the legality of 
temporary table join syntax in `sqlToRel` phase and make the thrown error 
clearer.
   
   
   ## Brief change log
   
   - Adding the check logical in `SqlToRelConverter`.
   - Adding test in `LookupJoinTest` and `TemporalJoinTest`
   
   ## Verifying this change
   
   - Adding test in `LookupJoinTest` and `TemporalJoinTest`
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers: no
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper:  no
 - The S3 file system connector: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? no docs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-33085) Improve the error message when the invalidate lookupTableSource without primary key is used as temporal join table

2023-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-33085:
---
Labels: pull-request-available  (was: )

> Improve the error message when the invalidate lookupTableSource without 
> primary key is used as temporal join table
> --
>
> Key: FLINK-33085
> URL: https://issues.apache.org/jira/browse/FLINK-33085
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Planner
>Affects Versions: 1.19.0
>Reporter: Yunhong Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Improve the error message when the invalidate lookupTableSource without 
> primary key is used as temporal join table.  This pr can check the legality 
> of temporary table join syntax in sqlToRel phase and make the thrown error 
> clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-ml] lindong28 commented on pull request #248: [FLINK-32704] Supports spilling to disk when feedback channel memory buffer is full

2023-09-13 Thread via GitHub


lindong28 commented on PR #248:
URL: https://github.com/apache/flink-ml/pull/248#issuecomment-1717534003

   Thanks for the update! LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] lindong28 merged pull request #248: [FLINK-32704] Supports spilling to disk when feedback channel memory buffer is full

2023-09-13 Thread via GitHub


lindong28 merged PR #248:
URL: https://github.com/apache/flink-ml/pull/248


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (FLINK-32704) Supports spilling to disk when feedback channel memory buffer is full

2023-09-13 Thread Dong Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned FLINK-32704:


Assignee: Jiang Xin

> Supports spilling to disk when feedback channel memory buffer is full
> -
>
> Key: FLINK-32704
> URL: https://issues.apache.org/jira/browse/FLINK-32704
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / Machine Learning
>Reporter: Jiang Xin
>Assignee: Jiang Xin
>Priority: Major
>  Labels: pull-request-available
> Fix For: ml-2.4.0
>
>
> Currently, the Flink ML Iteration cache feedback data in memory, which would 
> cause OOM in some cases. We need to support spilling to disk when feedback 
> channel memory buffer is full.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32704) Supports spilling to disk when feedback channel memory buffer is full

2023-09-13 Thread Dong Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764635#comment-17764635
 ] 

Dong Lin commented on FLINK-32704:
--

Merged to apache/flink-ml master branch 865404910caf53259df5cea1fc25ca29f96ae9bd

> Supports spilling to disk when feedback channel memory buffer is full
> -
>
> Key: FLINK-32704
> URL: https://issues.apache.org/jira/browse/FLINK-32704
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / Machine Learning
>Reporter: Jiang Xin
>Assignee: Jiang Xin
>Priority: Major
>  Labels: pull-request-available
> Fix For: ml-2.4.0
>
>
> Currently, the Flink ML Iteration cache feedback data in memory, which would 
> cause OOM in some cases. We need to support spilling to disk when feedback 
> channel memory buffer is full.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] flinkbot commented on pull request #23410: [FLINK-33085][table-planner] Improve the error message when the invalidated lookupTableSource without primary key is used as temporal join t

2023-09-13 Thread via GitHub


flinkbot commented on PR #23410:
URL: https://github.com/apache/flink/pull/23410#issuecomment-1717541375

   
   ## CI report:
   
   * a5f93f4153d0b8430d836a35a6f79b80caa76457 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode

2023-09-13 Thread Yangze Guo (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764640#comment-17764640
 ] 

Yangze Guo commented on FLINK-33053:


JFYI, I'm still investigating the root cause of this, but I found the issue 
will be fixed if we add a safetynet in 
ZooKeeperLeaderRetrievalDriver#close like this:
 
{code:java}
client.watchers()
   .removeAll()
   .ofType(Watcher.WatcherType.Any)
   .forPath(connectionInformationPath);{code}

> Watcher leak in Zookeeper HA mode
> -
>
> Key: FLINK-33053
> URL: https://issues.apache.org/jira/browse/FLINK-33053
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.0, 1.18.0, 1.17.1
>Reporter: Yangze Guo
>Priority: Blocker
> Attachments: 26.dump.zip, 26.log, 
> taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json
>
>
> We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA 
> mode. TM's watches on the leader of JobMaster has not been stopped after job 
> finished.
> Here is how we re-produce this issue:
>  - Start a session cluster and enable Zookeeper HA mode.
>  - Continuously and concurrently submit short queries, e.g. WordCount to the 
> cluster.
>  - echo -n wchp | nc \{zk host} \{zk port} to get current watches.
> We can see a lot of watches on 
> /flink/\{cluster_name}/leader/\{job_id}/connection_info.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-kubernetes-operator] dongwoo6kim commented on pull request #671: [FLINK-33066] Support all k8s methods to configure env variable in operatorPod

2023-09-13 Thread via GitHub


dongwoo6kim commented on PR #671:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/671#issuecomment-1717548787

   Thanks @mbalassi. I have updated the docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] hejufang commented on pull request #23403: [FLINK-32863][runtime-web] Improve Flink UI's time precision from sec…

2023-09-13 Thread via GitHub


hejufang commented on PR #23403:
URL: https://github.com/apache/flink/pull/23403#issuecomment-1717554816

   @masteryhx Thank you for your suggestion.  I have adjusted the precision of 
checkpoint related time.  please review. cc @KarmaGYZ 
   
![img_v2_a30822c5-c560-4789-9226-ca3899483b7g](https://github.com/apache/flink/assets/28342990/33f7b5b4-b01d-4217-b19c-b7425b0568d1)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (FLINK-32704) Supports spilling to disk when feedback channel memory buffer is full

2023-09-13 Thread Dong Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin closed FLINK-32704.

Resolution: Fixed

> Supports spilling to disk when feedback channel memory buffer is full
> -
>
> Key: FLINK-32704
> URL: https://issues.apache.org/jira/browse/FLINK-32704
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / Machine Learning
>Reporter: Jiang Xin
>Assignee: Jiang Xin
>Priority: Major
>  Labels: pull-request-available
> Fix For: ml-2.4.0
>
>
> Currently, the Flink ML Iteration cache feedback data in memory, which would 
> cause OOM in some cases. We need to support spilling to disk when feedback 
> channel memory buffer is full.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] Jiabao-Sun opened a new pull request, #23411: [FLINK-21949][table] Support ARRAY_AGG aggregate function

2023-09-13 Thread via GitHub


Jiabao-Sun opened a new pull request, #23411:
URL: https://github.com/apache/flink/pull/23411

   
   
   ## What is the purpose of the change
   
   [FLINK-21949][table] Support ARRAY_AGG aggregate function
   
   Some nosql databases like mongodb and elasticsearch support nested data 
types.
   Aggregating multiple rows into ARRAY is a common requirement.
   
   ## Brief change log
   
   Introduce built in function `ARRAY_AGG([ ALL | DISTINCT ] expression)` to 
return an array that concatenates the input rows
   and returns NULL if there are no input rows. NULL values will be ignored. 
Use DISTINCT for one unique instance of each value.
   
   ```sql
   SELECT ARRAY_AGG(f1)
 FROM tmp
GROUP BY f0
   ```
   
   
![image](https://github.com/apache/flink/assets/27403841/4ba953d0-92bf-485f-afc2-3fc292fc81ce)
   
   Note that we have made some simplifications based on Calcite's 
`SqlLibraryOperators.ARRAY_AGG`.
   ```sql
   -- calcite
   ARRAY_AGG([ ALL | DISTINCT ] value [ RESPECT NULLS | IGNORE NULLS ] [ ORDER 
BY orderItem [, orderItem ]* ] )
   -- flink
   ARRAY_AGG([ ALL | DISTINCT ] expression)
   ```
   
   **The differences from Calcite are as follows:**
 1. **Null values are ignored.**
 2. **The order by expression within the function is not supported because 
the complete row record cannot be accessed within the function implementation.**
 3. **The function returns null when there's no input rows, but calcite 
definition returns an empty array. The behavior was referenced from BigQuery 
and Postgres.**
   
   - 
https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#array_agg
   - https://www.postgresql.org/docs/8.4/functions-aggregate.html
   
   
   ## Verifying this change
   ITCase and UnitCase are added.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes)
 - The serializers: (no )
 - The runtime per-record code paths (performance sensitive): (no)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
 - The S3 file system connector: (no)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes)
 - If yes, how is the feature documented? (docs)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-21949) Support ARRAY_AGG aggregate function

2023-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-21949:
---
Labels: pull-request-available  (was: )

> Support ARRAY_AGG aggregate function
> 
>
> Key: FLINK-21949
> URL: https://issues.apache.org/jira/browse/FLINK-21949
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / API
>Affects Versions: 1.12.0
>Reporter: Jiabao Sun
>Assignee: Jiabao Sun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Some nosql databases like mongodb and elasticsearch support nested data types.
> Aggregating multiple rows into ARRAY is a common requirement.
> The CollectToArray function is similar to Collect, except that it returns 
> ARRAY instead of MULTISET.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-21949) Support ARRAY_AGG aggregate function

2023-09-13 Thread Jiabao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764658#comment-17764658
 ] 

Jiabao Sun commented on FLINK-21949:


The pull request is ready for review now.

This implementation made some simplifications based on Calcite's 
SqlLibraryOperators.ARRAY_AGG.
{code:java}
// calcite
ARRAY_AGG([ ALL | DISTINCT ] value [ RESPECT NULLS | IGNORE NULLS ] [ ORDER BY 
orderItem [, orderItem ]* ] )
// flink
ARRAY_AGG([ ALL | DISTINCT ] expression)
{code}

The differences from Calcite are as follows:
# Null values are ignored.
# The order by expression within the function is not supported because the 
complete row record cannot be accessed within the function implementation.
# The function returns null when there's no input rows, but calcite definition 
returns an empty array. The behavior was referenced from BigQuery and Postgres.

https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#array_agg
https://www.postgresql.org/docs/8.4/functions-aggregate.html

> Support ARRAY_AGG aggregate function
> 
>
> Key: FLINK-21949
> URL: https://issues.apache.org/jira/browse/FLINK-21949
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / API
>Affects Versions: 1.12.0
>Reporter: Jiabao Sun
>Assignee: Jiabao Sun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Some nosql databases like mongodb and elasticsearch support nested data types.
> Aggregating multiple rows into ARRAY is a common requirement.
> The CollectToArray function is similar to Collect, except that it returns 
> ARRAY instead of MULTISET.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] flinkbot commented on pull request #23411: [FLINK-21949][table] Support ARRAY_AGG aggregate function

2023-09-13 Thread via GitHub


flinkbot commented on PR #23411:
URL: https://github.com/apache/flink/pull/23411#issuecomment-1717597767

   
   ## CI report:
   
   * 10081ad9bdba84b3dac22fb7a6137994cc79622b UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-33083) SupportsReadingMetadata is not applied when loading a CompiledPlan

2023-09-13 Thread Yunhong Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764666#comment-17764666
 ] 

Yunhong Zheng commented on FLINK-33083:
---

It looks like this is a planner bug, and we don't have related tests to cover 
the situation that connector implement 

SupportsReadingMetadata and supportsMetadataProjection return false:
{code:java}
default boolean supportsMetadataProjection() {
return false;
}{code}
 

> SupportsReadingMetadata is not applied when loading a CompiledPlan
> --
>
> Key: FLINK-33083
> URL: https://issues.apache.org/jira/browse/FLINK-33083
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Affects Versions: 1.16.2, 1.17.1
>Reporter: Dawid Wysakowicz
>Priority: Major
>
> If a few conditions are met, we can not apply ReadingMetadata interface:
> # source overwrites:
>  {code}
> @Override
> public boolean supportsMetadataProjection() {
> return false;
> }
>  {code}
> # source does not implement {{SupportsProjectionPushDown}}
> # table has metadata columns e.g.
> {code}
> CREATE TABLE src (
>   physical_name STRING,
>   physical_sum INT,
>   timestamp TIMESTAMP_LTZ(3) NOT NULL METADATA VIRTUAL
> )
> {code}
> # we query the table {{SELECT * FROM src}}
> It fails with:
> {code}
> Caused by: java.lang.IllegalArgumentException: Row arity: 1, but serializer 
> arity: 2
>   at 
> org.apache.flink.table.runtime.typeutils.RowDataSerializer.copy(RowDataSerializer.java:124)
> {code}
> The reason is {{SupportsReadingMetadataSpec}} is created only in the 
> {{PushProjectIntoTableSourceScanRule}}, but the rule is not applied when 1 & 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-33083) SupportsReadingMetadata is not applied when loading a CompiledPlan

2023-09-13 Thread Dawid Wysakowicz (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Wysakowicz reassigned FLINK-33083:


Assignee: Dawid Wysakowicz

> SupportsReadingMetadata is not applied when loading a CompiledPlan
> --
>
> Key: FLINK-33083
> URL: https://issues.apache.org/jira/browse/FLINK-33083
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Affects Versions: 1.16.2, 1.17.1
>Reporter: Dawid Wysakowicz
>Assignee: Dawid Wysakowicz
>Priority: Major
>
> If a few conditions are met, we can not apply ReadingMetadata interface:
> # source overwrites:
>  {code}
> @Override
> public boolean supportsMetadataProjection() {
> return false;
> }
>  {code}
> # source does not implement {{SupportsProjectionPushDown}}
> # table has metadata columns e.g.
> {code}
> CREATE TABLE src (
>   physical_name STRING,
>   physical_sum INT,
>   timestamp TIMESTAMP_LTZ(3) NOT NULL METADATA VIRTUAL
> )
> {code}
> # we query the table {{SELECT * FROM src}}
> It fails with:
> {code}
> Caused by: java.lang.IllegalArgumentException: Row arity: 1, but serializer 
> arity: 2
>   at 
> org.apache.flink.table.runtime.typeutils.RowDataSerializer.copy(RowDataSerializer.java:124)
> {code}
> The reason is {{SupportsReadingMetadataSpec}} is created only in the 
> {{PushProjectIntoTableSourceScanRule}}, but the rule is not applied when 1 & 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28758) FlinkKafkaConsumer fails to stop with savepoint

2023-09-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764673#comment-17764673
 ] 

Piotr Nowojski commented on FLINK-28758:


I will try to take care of that

> FlinkKafkaConsumer fails to stop with savepoint 
> 
>
> Key: FLINK-28758
> URL: https://issues.apache.org/jira/browse/FLINK-28758
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.15.0, 1.17.0, 1.16.1
> Environment: Flink version:1.15.0
> deploy mode :K8s applicaiton Mode.   local mini cluster also have this 
> problem.
> Kafka Connector : use Kafka SourceFunction . No new Api.
>Reporter: hjw
>Assignee: Piotr Nowojski
>Priority: Critical
> Attachments: image-2022-10-13-19-47-56-635.png
>
>
> I post a stop with savepoint request to Flink Job throught rest api.
> A Error happened in Kafka connector close.
> The job will enter restarting .
> It is successful to use savepoint command alone.
> {code:java}
> 13:33:42.857 [Kafka Fetcher for Source: nlp-kafka-source -> nlp-clean 
> (1/1)#0] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - [Consumer 
> clientId=consumer-hjw-3, groupId=hjw] Kafka consumer has been closed
> 13:33:42.857 [Kafka Fetcher for Source: cpp-kafka-source -> cpp-clean 
> (1/1)#0] INFO org.apache.kafka.common.utils.AppInfoParser - App info 
> kafka.consumer for consumer-hjw-4 unregistered
> 13:33:42.857 [Kafka Fetcher for Source: cpp-kafka-source -> cpp-clean 
> (1/1)#0] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - [Consumer 
> clientId=consumer-hjw-4, groupId=hjw] Kafka consumer has been closed
> 13:33:42.860 [Source: nlp-kafka-source -> nlp-clean (1/1)#0] DEBUG 
> org.apache.flink.streaming.runtime.tasks.StreamTask - Cleanup StreamTask 
> (operators closed: false, cancelled: false)
> 13:33:42.860 [jobmanager-io-thread-4] INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline 
> checkpoint 5 by task eeefcb27475446241861ad8db3f33144 of job 
> d6fed247feab1c0bcc1b0dcc2cfb4736 at 79edfa88-ccc3-4140-b0e1-7ce8a7f8669f @ 
> 127.0.0.1 (dataPort=-1).
> org.apache.flink.util.SerializedThrowable: Task name with subtask : Source: 
> nlp-kafka-source -> nlp-clean (1/1)#0 Failure reason: Task has failed.
>  at 
> org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1388)
>  at 
> org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1331)
>  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
>  at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:343)
> Caused by: org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
>  at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>  ... 3 common frames omitted
> Caused by: org.apache.flink.util.SerializedThrowable: null
>  at 
> org.apache.flink.streaming.connectors.kafka.internals.Handover.close(Handover.java:177)
>  at 
> org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.cancel(KafkaFetcher.java:164)
>  at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.cancel(FlinkKafkaConsumerBase.java:945)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource.stop(StreamSource.java:128)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.stopOperatorForStopWithSavepoint(SourceStreamTask.java:305)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.lambda$triggerStopWithSavepointAsync$1(SourceStreamTask.java:285)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
>  at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMai

[jira] [Assigned] (FLINK-28758) FlinkKafkaConsumer fails to stop with savepoint

2023-09-13 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-28758:
--

Assignee: Piotr Nowojski  (was: Mark Cho)

> FlinkKafkaConsumer fails to stop with savepoint 
> 
>
> Key: FLINK-28758
> URL: https://issues.apache.org/jira/browse/FLINK-28758
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.15.0, 1.17.0, 1.16.1
> Environment: Flink version:1.15.0
> deploy mode :K8s applicaiton Mode.   local mini cluster also have this 
> problem.
> Kafka Connector : use Kafka SourceFunction . No new Api.
>Reporter: hjw
>Assignee: Piotr Nowojski
>Priority: Critical
> Attachments: image-2022-10-13-19-47-56-635.png
>
>
> I post a stop with savepoint request to Flink Job throught rest api.
> A Error happened in Kafka connector close.
> The job will enter restarting .
> It is successful to use savepoint command alone.
> {code:java}
> 13:33:42.857 [Kafka Fetcher for Source: nlp-kafka-source -> nlp-clean 
> (1/1)#0] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - [Consumer 
> clientId=consumer-hjw-3, groupId=hjw] Kafka consumer has been closed
> 13:33:42.857 [Kafka Fetcher for Source: cpp-kafka-source -> cpp-clean 
> (1/1)#0] INFO org.apache.kafka.common.utils.AppInfoParser - App info 
> kafka.consumer for consumer-hjw-4 unregistered
> 13:33:42.857 [Kafka Fetcher for Source: cpp-kafka-source -> cpp-clean 
> (1/1)#0] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - [Consumer 
> clientId=consumer-hjw-4, groupId=hjw] Kafka consumer has been closed
> 13:33:42.860 [Source: nlp-kafka-source -> nlp-clean (1/1)#0] DEBUG 
> org.apache.flink.streaming.runtime.tasks.StreamTask - Cleanup StreamTask 
> (operators closed: false, cancelled: false)
> 13:33:42.860 [jobmanager-io-thread-4] INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline 
> checkpoint 5 by task eeefcb27475446241861ad8db3f33144 of job 
> d6fed247feab1c0bcc1b0dcc2cfb4736 at 79edfa88-ccc3-4140-b0e1-7ce8a7f8669f @ 
> 127.0.0.1 (dataPort=-1).
> org.apache.flink.util.SerializedThrowable: Task name with subtask : Source: 
> nlp-kafka-source -> nlp-clean (1/1)#0 Failure reason: Task has failed.
>  at 
> org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1388)
>  at 
> org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1331)
>  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
>  at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:343)
> Caused by: org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
>  at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>  ... 3 common frames omitted
> Caused by: org.apache.flink.util.SerializedThrowable: null
>  at 
> org.apache.flink.streaming.connectors.kafka.internals.Handover.close(Handover.java:177)
>  at 
> org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.cancel(KafkaFetcher.java:164)
>  at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.cancel(FlinkKafkaConsumerBase.java:945)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource.stop(StreamSource.java:128)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.stopOperatorForStopWithSavepoint(SourceStreamTask.java:305)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.lambda$triggerStopWithSavepointAsync$1(SourceStreamTask.java:285)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
>  at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
>

[jira] [Commented] (FLINK-31966) Flink Kubernetes operator lacks TLS support

2023-09-13 Thread Gyula Fora (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764674#comment-17764674
 ] 

Gyula Fora commented on FLINK-31966:


[~tagarr] I think that sounds reasonable. I think this would work but I don't 
really know the exact expectation of users requiring this feature unfortunately 
:) 

> Flink Kubernetes operator lacks TLS support 
> 
>
> Key: FLINK-31966
> URL: https://issues.apache.org/jira/browse/FLINK-31966
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Adrian Vasiliu
>Priority: Major
>
> *Summary*
> The Flink Kubernetes operator lacks support inside the FlinkDeployment 
> operand for configuring Flink with TLS (both one-way and mutual) for the 
> internal communication between jobmanagers and taskmanagers, and for the 
> external REST endpoint. Although a workaround exists to configure the job and 
> task managers, this breaks the operator and renders it unable to reconcile.
> *Additional information*
>  * The Apache Flink operator supports passing through custom flink 
> configuration to be applied to job and task managers.
>  * If you supply SSL-based properties, the operator can no longer speak to 
> the deployed job manager. The operator is reading the flink conf and using it 
> to create a connection to the job manager REST endpoint, but it uses the 
> truststore file paths within flink-conf.yaml, which are unresolvable from the 
> operator. This leaves the operator hanging in a pending state as it cannot 
> complete a reconcile.
> *Proposal*
> Our proposal is to make changes to the operator code. A simple change exists 
> that would be enough to enable anonymous SSL at the REST endpoint, but more 
> invasive changes would be required to enable full mTLS throughout.
> The simple change to enable anonymous SSL would be for the operator to parse 
> flink-conf and podTemplate to identify the Kubernetes resource that contains 
> the certificate from the job manager keystore and use it inside the 
> operator’s trust store.
> In the case of mutual TLS, further changes are required: the operator would 
> need to generate a certificate signed by the same issuing authority as the 
> job manager’s certificates and then use it in a keystore when challenged by 
> that job manager. We propose that the operator becomes responsible for making 
> CertificateSigningRequests to generate certificates for job manager, task 
> manager and operator. The operator can then coordinate deploying the job and 
> task managers with the correct flink-conf and volume mounts. This would also 
> work for anonymous SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei commented on FLINK-33052:
--

# I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:30 PM:
---

Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.


was (Author: ym):
# I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:31 PM:
---

Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.


was (Author: ym):
Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:31 PM:
---

Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.


was (Author: ym):
Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:31 PM:
---

Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.


was (Author: ym):
Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-connector-kafka] boring-cyborg[bot] commented on pull request #48: [FLINK-28758] Fix stop-with-savepoint for FlinkKafkaConsumer

2023-09-13 Thread via GitHub


boring-cyborg[bot] commented on PR #48:
URL: 
https://github.com/apache/flink-connector-kafka/pull/48#issuecomment-1717647945

   Thanks for opening this pull request! Please check out our contributing 
guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-28758) FlinkKafkaConsumer fails to stop with savepoint

2023-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-28758:
---
Labels: pull-request-available  (was: )

> FlinkKafkaConsumer fails to stop with savepoint 
> 
>
> Key: FLINK-28758
> URL: https://issues.apache.org/jira/browse/FLINK-28758
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.15.0, 1.17.0, 1.16.1
> Environment: Flink version:1.15.0
> deploy mode :K8s applicaiton Mode.   local mini cluster also have this 
> problem.
> Kafka Connector : use Kafka SourceFunction . No new Api.
>Reporter: hjw
>Assignee: Piotr Nowojski
>Priority: Critical
>  Labels: pull-request-available
> Attachments: image-2022-10-13-19-47-56-635.png
>
>
> I post a stop with savepoint request to Flink Job throught rest api.
> A Error happened in Kafka connector close.
> The job will enter restarting .
> It is successful to use savepoint command alone.
> {code:java}
> 13:33:42.857 [Kafka Fetcher for Source: nlp-kafka-source -> nlp-clean 
> (1/1)#0] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - [Consumer 
> clientId=consumer-hjw-3, groupId=hjw] Kafka consumer has been closed
> 13:33:42.857 [Kafka Fetcher for Source: cpp-kafka-source -> cpp-clean 
> (1/1)#0] INFO org.apache.kafka.common.utils.AppInfoParser - App info 
> kafka.consumer for consumer-hjw-4 unregistered
> 13:33:42.857 [Kafka Fetcher for Source: cpp-kafka-source -> cpp-clean 
> (1/1)#0] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - [Consumer 
> clientId=consumer-hjw-4, groupId=hjw] Kafka consumer has been closed
> 13:33:42.860 [Source: nlp-kafka-source -> nlp-clean (1/1)#0] DEBUG 
> org.apache.flink.streaming.runtime.tasks.StreamTask - Cleanup StreamTask 
> (operators closed: false, cancelled: false)
> 13:33:42.860 [jobmanager-io-thread-4] INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline 
> checkpoint 5 by task eeefcb27475446241861ad8db3f33144 of job 
> d6fed247feab1c0bcc1b0dcc2cfb4736 at 79edfa88-ccc3-4140-b0e1-7ce8a7f8669f @ 
> 127.0.0.1 (dataPort=-1).
> org.apache.flink.util.SerializedThrowable: Task name with subtask : Source: 
> nlp-kafka-source -> nlp-clean (1/1)#0 Failure reason: Task has failed.
>  at 
> org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1388)
>  at 
> org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1331)
>  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
>  at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:343)
> Caused by: org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
>  at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>  ... 3 common frames omitted
> Caused by: org.apache.flink.util.SerializedThrowable: null
>  at 
> org.apache.flink.streaming.connectors.kafka.internals.Handover.close(Handover.java:177)
>  at 
> org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.cancel(KafkaFetcher.java:164)
>  at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.cancel(FlinkKafkaConsumerBase.java:945)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource.stop(StreamSource.java:128)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.stopOperatorForStopWithSavepoint(SourceStreamTask.java:305)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.lambda$triggerStopWithSavepointAsync$1(SourceStreamTask.java:285)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
>  at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
>  at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMai

[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:35 PM:
---

Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18 (RC0 is out, waiting for RC1)
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker for 1.18, we can chat off-line as well.


was (Author: ym):
Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker for 1.18, we can chat off-line as well.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:35 PM:
---

Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker for 1.18, we can chat off-line as well.


was (Author: ym):
Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker, we can chat off-line as well.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:39 PM:
---

Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18 (RC0 is out, waiting for RC1)
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker for 1.18, we can chat off-line as 
well. Or at least I do not want this to block the RC? 


was (Author: ym):
Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18 (RC0 is out, waiting for RC1)
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker for 1.18, we can chat off-line as well.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:41 PM:
---

Hey [~pnowojski] 
 # Strictly speaking, yes it is a blocker for release.
 # But since 1.18 has been branch cut and most of the tests have already been 
done (No feature is allowed to be merged), maybe we should not make this a 
blocker issue for RC1 of 1.18 (RC0 is out, waiting for RC1)?
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # I agree we should have this set up before 1.18 release.


was (Author: ym):
Hey [~pnowojski] 
 # I think this issue is very critical as you mentioned
 # Since 1.18 has been branch cut and most of the tests have already been done 
(No feature is allowed to be merged). I do not see why this should be a blocker 
issue for release 1.18 (RC0 is out, waiting for RC1)
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # If you think this is indeed a blocker for 1.18, we can chat off-line as 
well. Or at least I do not want this to block the RC? 

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 1:44 PM:
---

Hey [~pnowojski] 
 # Strictly speaking, yes it is a blocker for release.
 # But since 1.18 has been branch cut and most of the tests have already been 
done (No feature is allowed to be merged), maybe we should not make this a 
blocker issue for RC?
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # I agree we should have this set up before 1.18 release.


was (Author: ym):
Hey [~pnowojski] 
 # Strictly speaking, yes it is a blocker for release.
 # But since 1.18 has been branch cut and most of the tests have already been 
done (No feature is allowed to be merged), maybe we should not make this a 
blocker issue for RC1 of 1.18 (RC0 is out, waiting for RC1)?
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # I agree we should have this set up before 1.18 release.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-kubernetes-operator] mxm commented on a diff in pull request #672: [FLINK-33081] Apply parallelism overrides in scale

2023-09-13 Thread via GitHub


mxm commented on code in PR #672:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/672#discussion_r1324308777


##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##
@@ -74,6 +76,36 @@ public JobAutoScalerImpl(
 this.infoManager = new AutoscalerInfoManager();
 }
 
+@Override
+public void scale(FlinkResourceContext ctx) {
+var conf = ctx.getObserveConfig();
+var resource = ctx.getResource();
+var resourceId = ResourceID.fromResource(resource);
+var autoscalerMetrics = getOrInitAutoscalerFlinkMetrics(ctx, 
resourceId);
+
+try {
+if (resource.getSpec().getJob() == null || 
!conf.getBoolean(AUTOSCALER_ENABLED)) {
+LOG.debug("Autoscaler is disabled");
+return;
+}
+
+// Initialize metrics only if autoscaler is enabled
+var status = resource.getStatus();
+if (status.getLifecycleState() != ResourceLifecycleState.STABLE
+|| 
!status.getJobStatus().getState().equals(JobStatus.RUNNING.name())) {
+LOG.info("Autoscaler is waiting for RUNNING job state");
+lastEvaluatedMetrics.remove(resourceId);
+return;
+}
+
+updateParallelismOverrides(ctx, conf, resource, resourceId, 
autoscalerMetrics);
+} catch (Throwable e) {
+onError(ctx, resource, autoscalerMetrics, e);
+} finally {
+applyParallelismOverrides(ctx);

Review Comment:
   At first sight, this looks like the overrides will get applied, even if the 
autoscaler is disabled. There is another check though that prevents this here: 
https://github.com/apache/flink-kubernetes-operator/pull/672/files?diff=unified&w=1#diff-7df0c6b50a32c0055e6a1dcfcf9ab25cddb2a245b2125119fd9b57d65918698dR128
 (line 128)
   
   A bit confusing. See other comment line 88.



##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##
@@ -74,6 +76,36 @@ public JobAutoScalerImpl(
 this.infoManager = new AutoscalerInfoManager();
 }
 
+@Override
+public void scale(FlinkResourceContext ctx) {
+var conf = ctx.getObserveConfig();
+var resource = ctx.getResource();
+var resourceId = ResourceID.fromResource(resource);
+var autoscalerMetrics = getOrInitAutoscalerFlinkMetrics(ctx, 
resourceId);
+
+try {
+if (resource.getSpec().getJob() == null || 
!conf.getBoolean(AUTOSCALER_ENABLED)) {
+LOG.debug("Autoscaler is disabled");
+return;
+}
+
+// Initialize metrics only if autoscaler is enabled
+var status = resource.getStatus();
+if (status.getLifecycleState() != ResourceLifecycleState.STABLE
+|| 
!status.getJobStatus().getState().equals(JobStatus.RUNNING.name())) {
+LOG.info("Autoscaler is waiting for RUNNING job state");
+lastEvaluatedMetrics.remove(resourceId);
+return;
+}
+
+updateParallelismOverrides(ctx, conf, resource, resourceId, 
autoscalerMetrics);

Review Comment:
   ```suggestion
   runScalingLogic(ctx, conf, resource, resourceId, 
autoscalerMetrics);
   ```



##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##
@@ -74,6 +76,36 @@ public JobAutoScalerImpl(
 this.infoManager = new AutoscalerInfoManager();
 }
 
+@Override
+public void scale(FlinkResourceContext ctx) {
+var conf = ctx.getObserveConfig();
+var resource = ctx.getResource();
+var resourceId = ResourceID.fromResource(resource);
+var autoscalerMetrics = getOrInitAutoscalerFlinkMetrics(ctx, 
resourceId);
+
+try {
+if (resource.getSpec().getJob() == null || 
!conf.getBoolean(AUTOSCALER_ENABLED)) {
+LOG.debug("Autoscaler is disabled");

Review Comment:
   Would reset the overrides here.



##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##
@@ -74,6 +76,36 @@ public JobAutoScalerImpl(
 this.infoManager = new AutoscalerInfoManager();
 }
 
+@Override
+public void scale(FlinkResourceContext ctx) {
+var conf = ctx.getObserveConfig();
+var resource = ctx.getResource();
+var resourceId = ResourceID.fromResource(resource);
+var autoscalerMetrics = getOrInitAutoscalerFlinkMetrics(ctx, 
resourceId);
+

Review Comment:
   An alternative would be to apply the current overrides here and the new 
overrides after the scaling. That would get rid of the finally block.



-- 
This is an automated message from the Apache G

[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode

2023-09-13 Thread Zili Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764702#comment-17764702
 ] 

Zili Chen commented on FLINK-33053:
---

I noticed that the {{TreeCache}}'s close call {{removeWatches}} instead of 
{{removeAllWatches}} called by your scripts above.

{{removeWatches}} only remove the watcher in client side so remain the server 
side watcher as is.

> Watcher leak in Zookeeper HA mode
> -
>
> Key: FLINK-33053
> URL: https://issues.apache.org/jira/browse/FLINK-33053
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.0, 1.18.0, 1.17.1
>Reporter: Yangze Guo
>Priority: Blocker
> Attachments: 26.dump.zip, 26.log, 
> taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json
>
>
> We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA 
> mode. TM's watches on the leader of JobMaster has not been stopped after job 
> finished.
> Here is how we re-produce this issue:
>  - Start a session cluster and enable Zookeeper HA mode.
>  - Continuously and concurrently submit short queries, e.g. WordCount to the 
> cluster.
>  - echo -n wchp | nc \{zk host} \{zk port} to get current watches.
> We can see a lot of watches on 
> /flink/\{cluster_name}/leader/\{job_id}/connection_info.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33052) codespeed server is down

2023-09-13 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764675#comment-17764675
 ] 

Yuan Mei edited comment on FLINK-33052 at 9/13/23 2:14 PM:
---

Hey [~pnowojski] 
 # Strictly speaking, yes it is a blocker for release.
 # But since 1.18 has been branch cut and most of the tests have already been 
done (No feature is allowed to be merged), maybe we should not make this a 
blocker issue for RC?
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # I agree we should have this set up as soon as possible, best before 1.18 
release.


was (Author: ym):
Hey [~pnowojski] 
 # Strictly speaking, yes it is a blocker for release.
 # But since 1.18 has been branch cut and most of the tests have already been 
done (No feature is allowed to be merged), maybe we should not make this a 
blocker issue for RC?
 # [~Zakelly] has already been working on this, but as you can see this issue 
takes time (applying grants for buying new machines and set up everything, 
e.t.c)
 # I agree we should have this set up before 1.18 release.

> codespeed server is down
> 
>
> Key: FLINK-33052
> URL: https://issues.apache.org/jira/browse/FLINK-33052
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Test Infrastructure
>Affects Versions: 1.18.0
>Reporter: Jing Ge
>Assignee: Zakelly Lan
>Priority: Blocker
>
> No update in #flink-dev-benchmarks slack channel since 25th August.
> It was a EC2 running in a legacy aws account. Currently on one knows which 
> account it is. 
>  
> https://apache-flink.slack.com/archives/C0471S0DFJ9/p1693932155128359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] dawidwys opened a new pull request, #23412: [FLINK-33083] Properly apply ReadingMetadataSpec for a TableSourceScan

2023-09-13 Thread via GitHub


dawidwys opened a new pull request, #23412:
URL: https://github.com/apache/flink/pull/23412

   ## What is the purpose of the change
   
   The PR fixes creating a TableSourceScan to also include a proper 
ReadingMetadataSpec which has been applied on the source.
   
   
   ## Brief change log
   
   * revert the solution added in #22894 because it works around the outcome of 
the bug rather than fixes the bug
   * create a ReadingMetadataSpec when applying it on the `TableSource` in a 
`TableSourceScan`
   
   ## Verifying this change
   
   * tests added in #22894 should still pass
   * added a dedicated test 
`TableSourceJsonPlanITCase#testReadingMetadataWithProjectionPushDownDisabled`
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / **no**)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't 
know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / **no**)
 - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-33083) SupportsReadingMetadata is not applied when loading a CompiledPlan

2023-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-33083:
---
Labels: pull-request-available  (was: )

> SupportsReadingMetadata is not applied when loading a CompiledPlan
> --
>
> Key: FLINK-33083
> URL: https://issues.apache.org/jira/browse/FLINK-33083
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Affects Versions: 1.16.2, 1.17.1
>Reporter: Dawid Wysakowicz
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Labels: pull-request-available
>
> If a few conditions are met, we can not apply ReadingMetadata interface:
> # source overwrites:
>  {code}
> @Override
> public boolean supportsMetadataProjection() {
> return false;
> }
>  {code}
> # source does not implement {{SupportsProjectionPushDown}}
> # table has metadata columns e.g.
> {code}
> CREATE TABLE src (
>   physical_name STRING,
>   physical_sum INT,
>   timestamp TIMESTAMP_LTZ(3) NOT NULL METADATA VIRTUAL
> )
> {code}
> # we query the table {{SELECT * FROM src}}
> It fails with:
> {code}
> Caused by: java.lang.IllegalArgumentException: Row arity: 1, but serializer 
> arity: 2
>   at 
> org.apache.flink.table.runtime.typeutils.RowDataSerializer.copy(RowDataSerializer.java:124)
> {code}
> The reason is {{SupportsReadingMetadataSpec}} is created only in the 
> {{PushProjectIntoTableSourceScanRule}}, but the rule is not applied when 1 & 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] flinkbot commented on pull request #23412: [FLINK-33083] Properly apply ReadingMetadataSpec for a TableSourceScan

2023-09-13 Thread via GitHub


flinkbot commented on PR #23412:
URL: https://github.com/apache/flink/pull/23412#issuecomment-1717739419

   
   ## CI report:
   
   * 75b651538a1fa083c211ab8c4020822590b89043 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode

2023-09-13 Thread Zili Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764721#comment-17764721
 ] 

Zili Chen commented on FLINK-33053:
---

See https://lists.apache.org/thread/3b9hn9j4c05yfztlr2zcctbg7sqwdh58.

This seems to be a ZK issue that I met one year ago..

> Watcher leak in Zookeeper HA mode
> -
>
> Key: FLINK-33053
> URL: https://issues.apache.org/jira/browse/FLINK-33053
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.0, 1.18.0, 1.17.1
>Reporter: Yangze Guo
>Priority: Blocker
> Attachments: 26.dump.zip, 26.log, 
> taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json
>
>
> We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA 
> mode. TM's watches on the leader of JobMaster has not been stopped after job 
> finished.
> Here is how we re-produce this issue:
>  - Start a session cluster and enable Zookeeper HA mode.
>  - Continuously and concurrently submit short queries, e.g. WordCount to the 
> cluster.
>  - echo -n wchp | nc \{zk host} \{zk port} to get current watches.
> We can see a lot of watches on 
> /flink/\{cluster_name}/leader/\{job_id}/connection_info.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode

2023-09-13 Thread Zili Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764723#comment-17764723
 ] 

Zili Chen commented on FLINK-33053:
---

But we don't have other shared watchers so we can force remove watches as above.

> Watcher leak in Zookeeper HA mode
> -
>
> Key: FLINK-33053
> URL: https://issues.apache.org/jira/browse/FLINK-33053
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.0, 1.18.0, 1.17.1
>Reporter: Yangze Guo
>Priority: Blocker
> Attachments: 26.dump.zip, 26.log, 
> taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json
>
>
> We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA 
> mode. TM's watches on the leader of JobMaster has not been stopped after job 
> finished.
> Here is how we re-produce this issue:
>  - Start a session cluster and enable Zookeeper HA mode.
>  - Continuously and concurrently submit short queries, e.g. WordCount to the 
> cluster.
>  - echo -n wchp | nc \{zk host} \{zk port} to get current watches.
> We can see a lot of watches on 
> /flink/\{cluster_name}/leader/\{job_id}/connection_info.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-kubernetes-operator] sbrother commented on pull request #668: Give cluster/job role access to k8s services API

2023-09-13 Thread via GitHub


sbrother commented on PR #668:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/668#issuecomment-1717779633

   No problem, I just requested an Apache Jira account.
   
   And yes, I think that's what I mean. I had created a Flink Session 
Controller using a basic FlinkDeployment manifest with no job listed (I'm not 
100% on the naming of all the different pods, but this is the pod that exposes 
the Flink dashboard over port 8081). When I sshed into this pod I was surprised 
that I couldn't run `flink list`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >