[jira] [Created] (FLINK-30783) Add pull request template for AWS Connectors repo

2023-01-25 Thread Hong Liang Teoh (Jira)
Hong Liang Teoh created FLINK-30783:
---

 Summary: Add pull request template for AWS Connectors repo
 Key: FLINK-30783
 URL: https://issues.apache.org/jira/browse/FLINK-30783
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / AWS
Reporter: Hong Liang Teoh


Add a pull request template for Apache Flink AWS Connectors 
[https://github.com/apache/flink-connector-aws]

 

This improves our pull request and commit quality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30784) HiveTableSourceITCase.testPartitionFilter failed with assertion error

2023-01-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30784:
-

 Summary: HiveTableSourceITCase.testPartitionFilter  failed with 
assertion error
 Key: FLINK-30784
 URL: https://issues.apache.org/jira/browse/FLINK-30784
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive
Affects Versions: 1.17.0
Reporter: Matthias Pohl


We see a test failure in {{HiveTableSourceITCase.testPartitionFilter}}:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=45184&view=logs&j=a5ef94ef-68c2-57fd-3794-dc108ed1c495&t=2c68b137-b01d-55c9-e603-3ff3f320364b&l=23909

{code}
Jan 25 01:14:55 [ERROR] 
org.apache.flink.connectors.hive.HiveTableSourceITCase.testPartitionFilter  
Time elapsed: 2.212 s  <<< FAILURE!
Jan 25 01:14:55 org.opentest4j.AssertionFailedError: 
Jan 25 01:14:55 
Jan 25 01:14:55 Expecting value to be false but was true
Jan 25 01:14:55 at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Jan 25 01:14:55 at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
Jan 25 01:14:55 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
Jan 25 01:14:55 at 
org.apache.flink.connectors.hive.HiveTableSourceITCase.testPartitionFilter(HiveTableSourceITCase.java:314)
Jan 25 01:14:55 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
[...]
{code}

There's a similar test stability issue still open with FLINK-20975. The 
stacktraces don't match. That's why I decided to open a new one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30785) RocksDB Memory Management end-to-end test failed due to unexpected exception

2023-01-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30785:
-

 Summary: RocksDB Memory Management end-to-end test failed due to 
unexpected exception
 Key: FLINK-30785
 URL: https://issues.apache.org/jira/browse/FLINK-30785
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.17.0
Reporter: Matthias Pohl


We see a test instability with {{RocksDB Memory Management end-to-end test}}. 
The test failed because an exception was detected in the logs:
{code}
2023-01-25T02:47:38.7172354Z Jan 25 02:47:38 Checking for errors...
2023-01-25T02:47:39.1661969Z Jan 25 02:47:39 No errors in log files.
2023-01-25T02:47:39.1662430Z Jan 25 02:47:39 Checking for exceptions...
2023-01-25T02:47:39.2893767Z Jan 25 02:47:39 Found exception in log files; 
printing first 500 lines; see full logs for details:
[...]
2023-01-25T02:47:39.5674568Z Jan 25 02:47:39 Checking for non-empty .out 
files...
2023-01-25T02:47:39.5675055Z Jan 25 02:47:39 No non-empty .out files.
2023-01-25T02:47:39.5675352Z Jan 25 02:47:39 
2023-01-25T02:47:39.5676104Z Jan 25 02:47:39 [FAIL] 'RocksDB Memory Management 
end-to-end test' failed after 1 minutes and 50 seconds! Test exited with exit 
code 0 but the logs contained errors, exceptions or non-empty .out files
{code}

The only exception being reported in the Flink logs is due to a warning:
{code}
2023-01-25 02:47:38,242 WARN  
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
trigger or complete checkpoint 1 for job 421e4c00ef175b3b133d63cbfe9bca8b. (0 
consecutive failed attempts so far)
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator 
is suspending.
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1970)
 ~[flink-dist-1.17-SNAPSHOT.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46)
 ~[flink-dist-1.17-SNAPSHOT.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifyJobStatusChange(DefaultExecutionGraph.java:1578)
 ~[flink-dist-1.17-SNAPSHOT.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1173)
 ~[flink-dist-1.17-SNAPSHOT.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1145)
 ~[flink-dist-1.17-SNAPSHOT.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.cancel(DefaultExecutionGraph.java:973)
 ~[flink-dist-1.17-SNAPSHOT.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.SchedulerBase.cancel(SchedulerBase.java:671) 
~[flink-dist-1.17-SNAPSHOT.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.cancel(JobMaster.java:461) 
~[flink-dist-1.17-SNAPSHOT.jar:1.17-SNAPSHOT]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_352]
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_352]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_352]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_352]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:309)
 ~[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
 ~[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:307)
 ~[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:222)
 ~[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:84)
 ~[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:168)
 ~[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) 
[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) 
[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) 
[flink-rpc-akka_98d6268d-6cd0-412b-bd3c-ff411c887a5b.jar:1.17-SNAPSHOT]
a

Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

2023-01-25 Thread Matthias Pohl
Thanks Yang and Chesnay for your feedback. I'm gonna go ahead and start a
voting thread as this discussion thread is already open for some time.

Best,
Matthias

On Wed, Jan 25, 2023 at 4:08 AM Yang Wang  wrote:

> Having the *start()* in *LeaderContender* interface and bringing back the
> *LeaderElection* with some new methods make sense to me.
>
> I have no more concerns now.
>
>
> >- *LeaderContender*: The LeaderContender is integrated as usual except
> >that it accesses the LeaderElection instead of the
> LeaderElectionService.
> >It's going to call startLeaderElection(LeaderContender) where,
> previously,
> >LeaderElectionService.start(LeaderContender) was called.
> >
> > nit: we call the *LeaderElection#startLeaderElection()*, not the
> *LeaderElection#startLeaderElection(LeaderContender)*. Because we have
> already set the leaderContender in the
> *LeaderElection#register(LeaderContender)*.
>
>
> Best,
> Yang
>
>
> Chesnay Schepler  于2023年1月23日周一 23:16写道:
>
> > Thanks for updating the design. From my side this looks good.
> >
> > On 18/01/2023 17:59, Matthias Pohl wrote:
> > > After another round of discussion, I came up with a (hopefully) final
> > > proposal. The previously discussed approach was still not optimal
> because
> > > the contender ID lived in the LeaderContender even though it is
> actually
> > > LeaderElectionService-internal knowledge. Fixing that helped fix the
> > > overall architecture. Additionally, it brought back the LeaderElection
> > > interface with slightly different methods.
> > >
> > > I updated the "Code Cleanup: Merge
> MultipleComponentLeaderElectionService
> > > into LeaderElectionService" section and moved the old proposal into the
> > > section for rejected alternatives. Feel free to have another look at
> the
> > > updated version [1].
> > >
> > > Matthias
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > >
> > > On Wed, Jan 18, 2023 at 1:40 PM Matthias Pohl 
> > > wrote:
> > >
> > >> Thanks for participating in the discussion, Yang & Chesnay.
> > LeaderElection
> > >> interface extension gave me a headache as well. I added it initially
> > >> because I thought it would be of more value. But essentially, it
> doesn't
> > >> help but make the code harder to understand (as your questions
> > rightfully
> > >> point out). I agree that the FLIP is good enough without this
> > extension. I
> > >> moved it into the Rejected Alternatives section of the FLIP and would
> > >> propose going ahead without it.
> > >>
> > >> I will answer your questions about the LeaderElection extension,
> anyway:
> > >>
> > >> BTW, if the *LeaderElectionService#register(return LeaderElection)*
> and
> > >>> *LeaderElectionService#onGrantLeadership* are guarded by a same lock,
> > then
> > >>> we could ensure that the leaderElection in *LeaderContender* is
> always
> > >>> non-null when it tries to confirm the leadership. And then we do not
> > need
> > >>> the
> > >>> *LeaderContender#initializeLeaderElection*. Right?
> > >>
> > >> No, we still would need LeaderContender#initializeLeaderElection
> because
> > >> the LeaderElectionService needs to be capable of setting the
> > LeaderElection
> > >> within the LeaderContender before triggering the process for granting
> > the
> > >> leadership. This all needs to happen within the
> > >> LeaderElectionService#register(LeaderContender). It's indepent of the
> > lock.
> > >>
> > >> With the extension, how does the leader contender get access to the
> > >>> LeaderElection? I would've assumed that LEService returns a
> > LeaderElection
> > >>> when register is called, but according to the diagram this method
> > doesn't
> > >>> return anything. Is that what initiateLeaderElection is doing?
> > >>
> > >> Correct. My initial plan was to make
> > >> LeaderElectionService#register(LeaderContender) return the
> > LeaderElection
> > >> instance. That method could have been called within the
> LeaderContender.
> > >> But this approach has the flaw that LeaderContender would be in charge
> > >> within this control flow where, actually, we would want
> > >> LeaderElectionService to be still in charge to trigger the process for
> > >> granting the leadership. This required the
> > >> LeaderContender.initializeLeaderElection(LeaderElection) method to be
> > added
> > >> to enable the LeaderElectionService to do the initialization. I added
> a
> > >> comment to the corresponding class diagram to make this clearer.
> > >>
> > >> The DefaultLeaderElection will rely on package-private methods of the
> > >>> DLEService to handle confirm/hasLeadership calls?
> > >>
> > >> Correct. I added the missing package-private methods to the class
> > diagram
> > >> in the FLIP to clear things up.
> > >>
> > >> On Wed, Jan 18, 2023 at 11:47 AM Chesnay Schepler  >
> > >> wrote:
> > >>
> > >>> There are a lot of good things 

[VOTE] FLIP-285: Refactoring LeaderElection to make Flink support multi-component leader election out-of-the-box

2023-01-25 Thread Matthias Pohl
Hi everyone,
After the discussion thread [1] on FLIP-285 [2] didn't bring up any new
items, I want to start voting on FLIP-285. This FLIP will not only align
the leader election code base again through FLINK-26522 [3]. I also plan to
improve the test coverage for the leader election as part of this change
(covered in FLINK-30338 [4]).

The vote will remain open until at least Jan 30th (at least 72 hours)
unless there are some objections or insufficient votes.

Best,
Matthias

[1] https://lists.apache.org/thread/qrl881wykob3jnmzsof5ho8b9fgkklpt
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
[3] https://issues.apache.org/jira/browse/FLINK-26522
[4] https://issues.apache.org/jira/browse/FLINK-30338

-- 

[image: Aiven]

Matthias Pohl

Software Engineer, Aiven

matthias.p...@aiven.io 

aiven.io    |
 




Aiven Deutschland GmbH

Immanuelkirchstraße 26, 10405 Berlin

Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen

Amtsgericht Charlottenburg, HRB 209739 B


[jira] [Created] (FLINK-30786) Support merge by name for podTemplate array fields

2023-01-25 Thread Gyula Fora (Jira)
Gyula Fora created FLINK-30786:
--

 Summary: Support merge by name for podTemplate array fields
 Key: FLINK-30786
 URL: https://issues.apache.org/jira/browse/FLINK-30786
 Project: Flink
  Issue Type: New Feature
  Components: Kubernetes Operator
Reporter: Gyula Fora
Assignee: Gyula Fora


The operator currently merges the hierarchical pod template array fields 
(containers, volumes, volume mounts etc) by index.

In many cases these array fields already have a name attribute that could be 
used to merge by name. We should allow this configurable option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[SUMMARY] Flink 1.17 release sync 23rd of January, 2023

2023-01-25 Thread Martijn Visser
Hi everyone,

A summary of the release sync of yesterday:

- We still have 3 performance regressions (
https://issues.apache.org/jira/browse/FLINK-30623,
https://issues.apache.org/jira/browse/FLINK-30625,
https://issues.apache.org/jira/browse/FLINK-30624) that are being worked on
but need to be completed before the release branch cut on the 31st. If we
can't merge PRs to resolve this (at latest on Thursday the 26th) we will
revert the commits that introduced the regressions.
- There are 3 release blockers from a test perspective:
https://issues.apache.org/jira/browse/FLINK-29405,
https://issues.apache.org/jira/browse/FLINK-30727 and
https://issues.apache.org/jira/browse/FLINK-29427. Please make sure that if
you are assigned to this ticket, that you have marked the ticket as "In
Progress".
- The feature freeze starts on Thursday the 31st of January and the release
branch will be cut as soon as the blockers have been resolved. When the
release branch has been cut, the release testing will start.

Best regards,

Qingsheng, Leonard, Matthias and Martijn


[RESULT][VOTE] FLIP-290: Operator state compression (FLINK-30113)

2023-01-25 Thread Etienne Chauchot

Hi all,

I am happy to announce that FLIP-290 Operator state compression 
(FLINK-30113)[1] has
been accepted.

There are 3 binding votes and 2 non-binding vote [2]:

- Dawid Wysakowicz (binding)
- Piotr Nowojski (binding)
- Martijn Visser (binding)
- ConradJam (non-binding)
- Rui Fan (non-binding)


There was no disapproving vote.

Best
Etienne

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-290+Operator+state+compression
[2] https://lists.apache.org/thread/j09drt66y1gjo7c81lhmosgfhdwrq33g



[jira] [Created] (FLINK-30787) dmesg fails to save data to file due to permissions

2023-01-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30787:
-

 Summary: dmesg fails to save data to file due to permissions
 Key: FLINK-30787
 URL: https://issues.apache.org/jira/browse/FLINK-30787
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.15.3, 1.16.0, 1.17.0
Reporter: Matthias Pohl


We're not collecting the {{dmesg}} output due to a permission issue in any 
build:
{code}
2023-01-12T10:10:25.1598207Z dmesg: read kernel buffer failed: Operation not 
permitted
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30788) Refactor redundant code in AbstractHaServices

2023-01-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30788:
-

 Summary: Refactor redundant code in AbstractHaServices 
 Key: FLINK-30788
 URL: https://issues.apache.org/jira/browse/FLINK-30788
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.17.0
Reporter: Matthias Pohl


{{AbstractHaServices.createLeaderElectionService}} returns 
{{{}LeaderElectionService{}}}. All implementations return 
{{{}DefaultLeaderElectionService{}}}. The actual implementation-specific code 
creates the {{{}LeaderElectionDriverFactory{}}}.

We can remove the redundant code here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30789) Merge MultipleComponentsLeaderElectionDriver methods notifyAllKnownLeaderInformation and notifyLeaderInformationChange into a single method and

2023-01-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30789:
-

 Summary: Merge MultipleComponentsLeaderElectionDriver methods 
notifyAllKnownLeaderInformation and notifyLeaderInformationChange into a single 
method and 
 Key: FLINK-30789
 URL: https://issues.apache.org/jira/browse/FLINK-30789
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.17.0
Reporter: Matthias Pohl


The new interface proposed in FLIP-285 shall provide only a single method for 
writing data into a the HA backend.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30790) [Connectors/Jdbc] Refactor of testings

2023-01-25 Thread Jira
João Boto created FLINK-30790:
-

 Summary: [Connectors/Jdbc] Refactor of testings
 Key: FLINK-30790
 URL: https://issues.apache.org/jira/browse/FLINK-30790
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / JDBC
Reporter: João Boto


This is one attempt to get better results executing tests on jdbc connector..

In current master branch it takes like 15m to execute all tests (it will vary 
from 10m to 20m)

 
{code:java}
NFO] Reactor Summary for Flink : Connectors : JDBC : Parent 3.1-SNAPSHOT:
40943[INFO] 
40944[INFO] Flink : Connectors : JDBC : Parent . SUCCESS [ 
22.019 s]
40945[INFO] Flink : Connectors : JDBC .. SUCCESS [13:45 
min]
40946[INFO] 

40947[INFO] BUILD SUCCESS
40948[INFO] 

40949[INFO] Total time:  14:07 min
40950[INFO] Finished at: 2022-12-23T09:22:46Z {code}
 

 

The main problems that we see on testings are:
 * Parameterized tests that setup database containers (this tends to be more 
time consuming as another databases are added)
 * Creation of containers by class (this is fine as we want test to be 
independents, but the database could be setup 1time and cleaned at end of each 
class)
 * No easy way to extend a test to another database, at the end we copy a lot 
of code
 * A lot of code for create and populate tables for testing. We have 
JdbcTestBase that use JdbcTestFixture to create a kind of 'Book Store', but 
this is used in a lot of test without the use of all tables that are 
implemented on store. 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30791) Codespeed machine is not responding

2023-01-25 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30791:
--

 Summary: Codespeed machine is not responding
 Key: FLINK-30791
 URL: https://issues.apache.org/jira/browse/FLINK-30791
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.16.0, 1.17.0
Reporter: Piotr Nowojski


Neither speedcenter: [http://codespeed.dak8s.net:8000/]

nor jenkins: [http://codespeed.dak8s.net:8080|http://codespeed.dak8s.net:8080/]

are responding



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Doubt with kafka commits taking too long

2023-01-25 Thread André Leite
Hello team,
I'm having issues understanding why my flink job commits to kafka consumer
is taking so long. I have a checkpoint of 1s and the following warning
appears. I'm currently using version 1.14.

*Committing offsets to Kafka takes longer than the checkpoint interval.
Skipping commit of previous offsets because newer complete checkpoint
offsets are available. This does not compromise Flink's checkpoint
integrity*

Compared to some Kafka streams we have running, the commit latency takes
around 100 ms.
Can you point me in the right direction? Are there any metrics that I can
look at?

Best regards,

André Leite


Re: [SUMMARY] Flink 1.17 release sync 23rd of January, 2023

2023-01-25 Thread Lijie Wang
Hi Martijn,

I'm working on FLINK-30624,  and it may take a while to be resolved. Do you
mean we should resolve it before the 26th? I used to think the deadline was
the 31st(the date of feature freeze).

Best,
Lijie

Martijn Visser  于2023年1月25日周三 18:07写道:

> Hi everyone,
>
> A summary of the release sync of yesterday:
>
> - We still have 3 performance regressions (
> https://issues.apache.org/jira/browse/FLINK-30623,
> https://issues.apache.org/jira/browse/FLINK-30625,
> https://issues.apache.org/jira/browse/FLINK-30624) that are being worked
> on
> but need to be completed before the release branch cut on the 31st. If we
> can't merge PRs to resolve this (at latest on Thursday the 26th) we will
> revert the commits that introduced the regressions.
> - There are 3 release blockers from a test perspective:
> https://issues.apache.org/jira/browse/FLINK-29405,
> https://issues.apache.org/jira/browse/FLINK-30727 and
> https://issues.apache.org/jira/browse/FLINK-29427. Please make sure that
> if
> you are assigned to this ticket, that you have marked the ticket as "In
> Progress".
> - The feature freeze starts on Thursday the 31st of January and the release
> branch will be cut as soon as the blockers have been resolved. When the
> release branch has been cut, the release testing will start.
>
> Best regards,
>
> Qingsheng, Leonard, Matthias and Martijn
>


[jira] [Created] (FLINK-30792) clean up not uploaded state changes after materialization complete

2023-01-25 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-30792:
---

 Summary: clean up not uploaded state changes after materialization 
complete
 Key: FLINK-30792
 URL: https://issues.apache.org/jira/browse/FLINK-30792
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.16.0
Reporter: Feifan Wang


We should clean up not uploaded state changes after materialization completed, 
otherwise it will cause (status quo) : 
 # subsequent checkpoints contain wrong state changes which before completed 
materialization
 # FileNotFound exception may occur when recovering from the above problematic 
checkpoint, because the state change files before completed materialization may 
have been deleted with the checkpoint subsuming.

Since state changes before completed materialization in 
FsStateChangelogWriter#notUploaded will not be used in any subsequent 
checkpoint, I suggest clean up it while handle materialization result. 

How do you think about this ? [~ym] , [~roman] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30793) PyFlink YARN per-job on Docker test fails on Azure due to permission issues

2023-01-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30793:
-

 Summary: PyFlink YARN per-job on Docker test fails on Azure due to 
permission issues
 Key: FLINK-30793
 URL: https://issues.apache.org/jira/browse/FLINK-30793
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hadoop Compatibility, Deployment / YARN
Affects Versions: 1.17.0
Reporter: Matthias Pohl


The following build failed due to some hdfs/yarn permission issues in  PyFlink 
YARN per-job on Docker e2e test:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=45202&view=logs&j=af184cdd-c6d8-5084-0b69-7e9c67b35f7a&t=160c9ae5-96fd-516e-1c91-deb81f59292a&l=10587

{code}
[...]
Jan 26 02:17:31 23/01/26 02:12:20 FATAL hs.JobHistoryServer: Error starting 
JobHistoryServer
Jan 26 02:17:31 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Error 
creating done directory: 
[hdfs://master.docker-hadoop-cluster-network:9000/tmp/hadoop-yarn/staging/history/done]
Jan 26 02:17:31 at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.tryCreatingHistoryDirs(HistoryFileManager.java:698)
Jan 26 02:17:31 at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.createHistoryDirs(HistoryFileManager.java:634)
Jan 26 02:17:31 at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.serviceInit(HistoryFileManager.java:595)
Jan 26 02:17:31 at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
Jan 26 02:17:31 at 
org.apache.hadoop.mapreduce.v2.hs.JobHistory.serviceInit(JobHistory.java:96)
Jan 26 02:17:31 at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
Jan 26 02:17:31 at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
Jan 26 02:17:31 at 
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.serviceInit(JobHistoryServer.java:152)
Jan 26 02:17:31 at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
Jan 26 02:17:31 at 
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.launchJobHistoryServer(JobHistoryServer.java:228)
Jan 26 02:17:31 at 
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.main(JobHistoryServer.java:238)
Jan 26 02:17:31 Caused by: org.apache.hadoop.security.AccessControlException: 
Permission denied: user=mapred, access=WRITE, inode="/":hdfs:hadoop:drwxr-xr-x
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:350)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:251)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:189)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1756)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1740)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1699)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:60)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3007)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1141)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:659)
Jan 26 02:17:31 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)