[jira] [Created] (FLINK-26041) AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure hang on azure

2022-02-09 Thread Yun Gao (Jira)
Yun Gao created FLINK-26041:
---

 Summary: 
AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure 
hang on azure
 Key: FLINK-26041
 URL: https://issues.apache.org/jira/browse/FLINK-26041
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Yun Gao



{code:java}
Feb 08 13:04:58 "main" #1 prio=5 os_prio=0 tid=0x7fdcf000b800 nid=0x47bd 
waiting on condition [0x7fdcf697b000]
Feb 08 13:04:58java.lang.Thread.State: WAITING (parking)
Feb 08 13:04:58 at sun.misc.Unsafe.park(Native Method)
Feb 08 13:04:58 - parking to wait for  <0x8f644330> (a 
java.util.concurrent.CompletableFuture$Signaller)
Feb 08 13:04:58 at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
Feb 08 13:04:58 at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
Feb 08 13:04:58 at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
Feb 08 13:04:58 at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
Feb 08 13:04:58 at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
Feb 08 13:04:58 at 
org.apache.flink.util.AutoCloseableAsync.close(AutoCloseableAsync.java:36)
Feb 08 13:04:58 at 
org.apache.flink.test.recovery.AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure(AbstractTaskManagerProcessFailureRecoveryTest.java:209)
Feb 08 13:04:58 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
Feb 08 13:04:58 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Feb 08 13:04:58 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Feb 08 13:04:58 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 08 13:04:58 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
Feb 08 13:04:58 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
Feb 08 13:04:58 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
Feb 08 13:04:58 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
Feb 08 13:04:58 at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
Feb 08 13:04:58 at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
Feb 08 13:04:58 at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
Feb 08 13:04:58 at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
Feb 08 13:04:58 at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
Feb 08 13:04:58 at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
Feb 08 13:04:58 at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
Feb 08 13:04:58 at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
Feb 08 13:04:58 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
Feb 08 13:04:58 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
Feb 08 13:04:58 at 
org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
Feb 08 13:04:58 at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
Feb 08 13:04:58 at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
Feb 08 13:04:58 at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
Feb 08 13:04:58 at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
Feb 08 13:04:58 at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413)
Feb 08 13:04:58 at org.junit.runners.Suite.runChild(Suite.java:128)
Feb 08 13:04:58 at org.junit.runners.Suite.runChild(Suite.java:27)

{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30901&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=14617



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26042) PyFlinkEmbeddedSubInterpreterTests. test_udf_without_arguments failed on azure

2022-02-09 Thread Yun Gao (Jira)
Yun Gao created FLINK-26042:
---

 Summary: PyFlinkEmbeddedSubInterpreterTests. 
test_udf_without_arguments failed on azure
 Key: FLINK-26042
 URL: https://issues.apache.org/jira/browse/FLINK-26042
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.15.0
Reporter: Yun Gao



{code:java}
2022-02-08T02:55:16.0701246Z Feb 08 02:55:16 
=== FAILURES ===
2022-02-08T02:55:16.0702483Z Feb 08 02:55:16  
PyFlinkEmbeddedSubInterpreterTests.test_udf_without_arguments _
2022-02-08T02:55:16.0703190Z Feb 08 02:55:16 
2022-02-08T02:55:16.0703959Z Feb 08 02:55:16 self = 

2022-02-08T02:55:16.0704967Z Feb 08 02:55:16 
2022-02-08T02:55:16.0705639Z Feb 08 02:55:16 def 
test_udf_without_arguments(self):
2022-02-08T02:55:16.0706641Z Feb 08 02:55:16 one = udf(lambda: 1, 
result_type=DataTypes.BIGINT(), deterministic=True)
2022-02-08T02:55:16.0707595Z Feb 08 02:55:16 two = udf(lambda: 2, 
result_type=DataTypes.BIGINT(), deterministic=False)
2022-02-08T02:55:16.0713079Z Feb 08 02:55:16 
2022-02-08T02:55:16.0714866Z Feb 08 02:55:16 table_sink = 
source_sink_utils.TestAppendSink(['a', 'b'],
2022-02-08T02:55:16.0716495Z Feb 08 02:55:16
   [DataTypes.BIGINT(), DataTypes.BIGINT()])
2022-02-08T02:55:16.0717411Z Feb 08 02:55:16 
self.t_env.register_table_sink("Results", table_sink)
2022-02-08T02:55:16.0718059Z Feb 08 02:55:16 
2022-02-08T02:55:16.0719148Z Feb 08 02:55:16 t = 
self.t_env.from_elements([(1, 2), (2, 5), (3, 1)], ['a', 'b'])
2022-02-08T02:55:16.0719974Z Feb 08 02:55:16 >   t.select(one(), 
two()).execute_insert("Results").wait()
2022-02-08T02:55:16.0720697Z Feb 08 02:55:16 
2022-02-08T02:55:16.0721294Z Feb 08 02:55:16 
pyflink/table/tests/test_udf.py:252: 
2022-02-08T02:55:16.0722119Z Feb 08 02:55:16 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2022-02-08T02:55:16.0722943Z Feb 08 02:55:16 pyflink/table/table_result.py:76: 
in wait
2022-02-08T02:55:16.0723686Z Feb 08 02:55:16 
get_method(self._j_table_result, "await")()
2022-02-08T02:55:16.0725024Z Feb 08 02:55:16 
.tox/py37-cython/lib/python3.7/site-packages/py4j/java_gateway.py:1322: in 
__call__
2022-02-08T02:55:16.0726044Z Feb 08 02:55:16 answer, self.gateway_client, 
self.target_id, self.name)
2022-02-08T02:55:16.0726824Z Feb 08 02:55:16 pyflink/util/exceptions.py:146: in 
deco
2022-02-08T02:55:16.0727569Z Feb 08 02:55:16 return f(*a, **kw)
2022-02-08T02:55:16.0728326Z Feb 08 02:55:16 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2022-02-08T02:55:16.0728995Z Feb 08 02:55:16 
2022-02-08T02:55:16.0729717Z Feb 08 02:55:16 answer = 'x'
2022-02-08T02:55:16.0730447Z Feb 08 02:55:16 gateway_client = 

2022-02-08T02:55:16.0731465Z Feb 08 02:55:16 target_id = 'o26503', name = 
'await'
2022-02-08T02:55:16.0732045Z Feb 08 02:55:16 
2022-02-08T02:55:16.0732763Z Feb 08 02:55:16 def get_return_value(answer, 
gateway_client, target_id=None, name=None):
2022-02-08T02:55:16.0733699Z Feb 08 02:55:16 """Converts an answer 
received from the Java gateway into a Python object.
2022-02-08T02:55:16.0734508Z Feb 08 02:55:16 
2022-02-08T02:55:16.0735205Z Feb 08 02:55:16 For example, string 
representation of integers are converted to Python
2022-02-08T02:55:16.0736228Z Feb 08 02:55:16 integer, string 
representation of objects are converted to JavaObject
2022-02-08T02:55:16.0736974Z Feb 08 02:55:16 instances, etc.
2022-02-08T02:55:16.0737508Z Feb 08 02:55:16 
2022-02-08T02:55:16.0738185Z Feb 08 02:55:16 :param answer: the string 
returned by the Java gateway
2022-02-08T02:55:16.0739074Z Feb 08 02:55:16 :param gateway_client: the 
gateway client used to communicate with the Java
2022-02-08T02:55:16.0739994Z Feb 08 02:55:16 Gateway. Only 
necessary if the answer is a reference (e.g., object,
2022-02-08T02:55:16.0740723Z Feb 08 02:55:16 list, map)
2022-02-08T02:55:16.0741491Z Feb 08 02:55:16 :param target_id: the name 
of the object from which the answer comes from
2022-02-08T02:55:16.0742350Z Feb 08 02:55:16 (e.g., *object1* in 
`object1.hello()`). Optional.
2022-02-08T02:55:16.0743175Z Feb 08 02:55:16 :param name: the name of 
the member from which the answer comes from
2022-02-08T02:55:16.0744315Z Feb 08 02:55:16 (e.g., *hello* in 
`object1.hello()`). Optional.
2022-02-08T02:55:16.0744973Z Feb 08 02:55:16 """
2022-02-08T02:55:16.0745608Z Feb 08 02:55:16 if is_error(answer)[0]:
2022-02-08T02:55:16.0746484Z Feb 08 02:55:16 if len(answer) > 1:
2022-02-08T02:55:16.0747162Z Feb 08 02:55:16 type = answer[1]
2022-02-08T02:55:16

[jira] [Created] (FLINK-26043) Add periodic kerberos relogin to DelegationTokenManager

2022-02-09 Thread Gabor Somogyi (Jira)
Gabor Somogyi created FLINK-26043:
-

 Summary: Add periodic kerberos relogin to DelegationTokenManager
 Key: FLINK-26043
 URL: https://issues.apache.org/jira/browse/FLINK-26043
 Project: Flink
  Issue Type: Sub-task
Affects Versions: 1.15.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26044) Properly declare internal UNNEST function

2022-02-09 Thread Timo Walther (Jira)
Timo Walther created FLINK-26044:


 Summary: Properly declare internal UNNEST function
 Key: FLINK-26044
 URL: https://issues.apache.org/jira/browse/FLINK-26044
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Timo Walther
Assignee: Timo Walther


UNNEST is not declared as an internal function. We should declare it similar to 
REPLICATE ROWS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Statefun async http request via RequestReplyFunctionBuilder

2022-02-09 Thread Till Rohrmann
Hi Galen,

Great to hear it :-) I've assigned you to the ticket [1]. Next you can open
a PR against the repository and we will review it.

[1] https://issues.apache.org/jira/browse/FLINK-25933

Cheers,
Till

On Tue, Feb 8, 2022 at 11:58 PM Galen Warren 
wrote:

> I'm ready to pick this one up, I have some code that's working locally.
> Shall I create a PR?
>
>
>
> On Wed, Feb 2, 2022 at 3:17 PM Igal Shilman 
> wrote:
>
> > Great, ping me when you would like to pick this up.
> >
> > For the related issue, I think that can be a good addition indeed!
> >
> > On Wed, Feb 2, 2022 at 8:55 PM Galen Warren 
> > wrote:
> >
> > > Gotcha, thanks. I may be able to work on that one in a couple weeks if
> > > you're looking for help.
> > >
> > > Unrelated question -- another thing that would be useful for me would
> be
> > > the ability to set a maximum backoff interval in
> > BoundedExponentialBackoff
> > > or the async equivalent. My situation is this. I'd like to set a long
> > > maxRequestDuration, so it takes a good while for Statefun to "give up"
> > on a
> > > function call, i.e. perhaps several hours or even days. During that
> time,
> > > if the backoff interval doubles on each failure, those backoff
> intervals
> > > get pretty long.
> > >
> > > Sometimes, in exponential backoff implementations, I've seen the
> concept
> > of
> > > a max backoff interval, i.e. once the backoff interval reaches that
> > point,
> > > it won't go any higher. So I could set it to, say, 60 seconds, and no
> > > matter how long it would retry the function, the interval between
> retries
> > > wouldn't be more than that.
> > >
> > > Do you think that would be a useful addition? I could post something to
> > the
> > > dev list if you want.
> > >
> > > On Wed, Feb 2, 2022 at 2:39 PM Igal Shilman  wrote:
> > >
> > > > Hi Galen,
> > > > You are right, it is not possible, but there is no real reason for
> > that.
> > > > We should fix this, and I've created the following JIRA issue [1]
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-25933
> > > >
> > > > On Wed, Feb 2, 2022 at 6:30 PM Galen Warren  >
> > > > wrote:
> > > >
> > > > > Is it possible to choose the async HTTP transport using
> > > > > RequestReplyFunctionBuilder? It looks to me that it is not, but I
> > > wanted
> > > > to
> > > > > double check. Thanks.
> > > > >
> > > >
> > >
> >
>


[jira] [Created] (FLINK-26045) Document error status codes for the REST API

2022-02-09 Thread Jira
David Morávek created FLINK-26045:
-

 Summary: Document error status codes for the REST API
 Key: FLINK-26045
 URL: https://issues.apache.org/jira/browse/FLINK-26045
 Project: Flink
  Issue Type: Technical Debt
  Components: Documentation, Runtime / REST
Reporter: David Morávek
 Fix For: 1.15.0


We've introduced new error status codes in FLINK-24275, which are not projected 
in the REST documentation and the OpenAPI spec file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26046) Link to OpenAPI specification is dead

2022-02-09 Thread Jira
David Morávek created FLINK-26046:
-

 Summary: Link to OpenAPI specification is dead
 Key: FLINK-26046
 URL: https://issues.apache.org/jira/browse/FLINK-26046
 Project: Flink
  Issue Type: Technical Debt
  Components: Documentation
Reporter: David Morávek
 Fix For: 1.15.0


In the REST API docs, the link to the OpenAPI specification is missing the 
flink prefix, so it's leading nowhere.

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Azure Pipelines are dealing with an incident, causing pipeline runs to fail

2022-02-09 Thread Martijn Visser
Hi everyone,

Please keep in mind that Azure Pipelines currently is dealing with an
incident [1] which causes all CI pipeline runs on Azure to fail. When the
incident has been resolved, it will be required to retrigger your pipeline
to see if the pipeline then passes.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82

[1] https://status.dev.azure.com/_event/287959626


[jira] [Created] (FLINK-26047) Support usrlib in HDFS for YARN application mode

2022-02-09 Thread Biao Geng (Jira)
Biao Geng created FLINK-26047:
-

 Summary: Support usrlib in HDFS for YARN application mode
 Key: FLINK-26047
 URL: https://issues.apache.org/jira/browse/FLINK-26047
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / YARN
Reporter: Biao Geng


In YARN Application mode, we currently support using user jar and lib jar from 
HDFS. For example, we can run commands like:
{quote}./bin/flink run-application -t yarn-application \
-Dyarn.provided.lib.dirs="hdfs://myhdfs/my-remote-flink-dist-dir" \
hdfs://myhdfs/jars/my-application.jar{quote}
For {{usrlib}}, we currently only support local directory. I propose to add 
HDFS support for {{usrlib}} to work with CLASSPATH_INCLUDE_USER_JAR better. It 
can also benefit cases like using notebook to submit flink job.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26048) Yarn access should be separated by semicolon

2022-02-09 Thread MichealShin (Jira)
MichealShin created FLINK-26048:
---

 Summary: Yarn access should be separated by semicolon
 Key: FLINK-26048
 URL: https://issues.apache.org/jira/browse/FLINK-26048
 Project: Flink
  Issue Type: Bug
Reporter: MichealShin


Yarn access should be separated by semicolon, otherwise, the config can not be 
read correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-09 Thread fanrui (Jira)
fanrui created FLINK-26049:
--

 Summary: The tolerable-failed-checkpoints logic is invalid when 
checkpoint trigger failed
 Key: FLINK-26049
 URL: https://issues.apache.org/jira/browse/FLINK-26049
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.14.3, 1.13.5
Reporter: fanrui
 Fix For: 1.15.0
 Attachments: image-2022-02-09-18-08-17-868.png, 
image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png

After triggerCheckpoint, if checkpoint failed, flink will execute the 
tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
won't execute the tolerable-failed-checkpoints logic.
h1. How to reproduce this issue?

In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
job don't have permission to create checkpoint dir. So cause flink trigger 
checkpoint failed.

There are some didn't meet expectations:
 * JM just log _"Failed to trigger checkpoint for job 
6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
show the root cause or exception.
 * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
flink won't execute the tolerable-failed-checkpoints logic. 
 * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
 * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
history page.

 

!image-2022-02-09-18-08-17-868.png!

 

!image-2022-02-09-18-08-34-992.png!

!image-2022-02-09-18-08-42-920.png!

 
h3. *All metrics are normal, so the next day we found out that the checkpoint 
failed, and the checkpoint has been failing for a day. it's not acceptable to 
the flink user.*

I have some ideas:
 # Should tolerable-failed-checkpoints logic be executed when triggerCheckpoint 
fails?
 # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
 # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
history page?
 # JM just show "Failed to trigger checkpoint", should we show detailed 
exception to easy find the root cause?

 

Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26051) row_number =1 and Subsequent SQL has "case when" and "where" statement : The window can only be ordered in ASCENDING mode

2022-02-09 Thread chuncheng wu (Jira)
chuncheng wu created FLINK-26051:


 Summary: row_number =1 and Subsequent SQL has "case when" and 
"where" statement : The window can only be ordered in ASCENDING mode
 Key: FLINK-26051
 URL: https://issues.apache.org/jira/browse/FLINK-26051
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.12.2
Reporter: chuncheng wu


hello,

   i have 2 sqls. One sql has rn=1 and  the Subsequent SQL has "case when" and 
"where".it results the exception as follow. It happen in the occasion when 
logical plan turn into physical plan :
{quote}_org.apache.flink.table.api.TableException: The window can only be 
ordered in ASCENDING mode._

    _at 
org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecOverAggregate.translateToPlanInternal(StreamExecOverAggregate.scala:98)_
    _at 
org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecOverAggregate.translateToPlanInternal(StreamExecOverAggregate.scala:52)_
    _at 
org.apache.flink.table.planner.plan.nodes.exec.ExecNode$class.translateToPlan(ExecNode.scala:59)_
    _at 
org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecOverAggregateBase.translateToPlan(StreamExecOverAggregateBase.scala:42)_
    _at 
org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecCalc.translateToPlanInternal(StreamExecCalc.scala:54)_
    _at 
org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecCalc.translateToPlanInternal(StreamExecCalc.scala:39)_
    _at 
org.apache.flink.table.planner.plan.nodes.exec.ExecNode$class.translateToPlan(ExecNode.scala:59)_
    _at 
org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecCalcBase.translateToPlan(StreamExecCalcBase.scala:38)_
    _at 
org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$translateToPlan$1.apply(StreamPlanner.scala:66)_
    _at 
org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$translateToPlan$1.apply(StreamPlanner.scala:65)_
    _at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)_
    _at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)_
    _at scala.collection.Iterator$class.foreach(Iterator.scala:891)_
    _at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)_
    _at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)_
    _at scala.collection.AbstractIterable.foreach(Iterable.scala:54)_
    _at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)_
    _at scala.collection.AbstractTraversable.map(Traversable.scala:104)_
    _at 
org.apache.flink.table.planner.delegation.StreamPlanner.translateToPlan(StreamPlanner.scala:65)_
    _at 
org.apache.flink.table.planner.delegation.StreamPlanner.explain(StreamPlanner.scala:103)_
    _at 
org.apache.flink.table.planner.delegation.StreamPlanner.explain(StreamPlanner.scala:42)_
    _at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.explainInternal(TableEnvironmentImpl.java:630)_
    _at 
org.apache.flink.table.api.internal.TableImpl.explain(TableImpl.java:582)_
    _at 
com.meituan.grocery.data.flink.test.BugTest.testRowNumber(BugTest.java:69)_
    _at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)_
    _at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)_
    _at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)_
    _at java.base/java.lang.reflect.Method.invoke(Method.java:568)_
    _at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)_
    _at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)_
    _at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)_
    _at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)_
    _at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)_
    _at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)_
    _at 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)_
    _at 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)_
    _at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)_
    _at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)_
    _at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)_
    _at 
org.junit.jupiter.engine.execut

[jira] [Created] (FLINK-26050) Too Many small sst files in rocksdb state backend when using processing time window

2022-02-09 Thread shen (Jira)
shen created FLINK-26050:


 Summary: Too Many small sst files in rocksdb state backend when 
using processing time window
 Key: FLINK-26050
 URL: https://issues.apache.org/jira/browse/FLINK-26050
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.14.3, 1.10.2
Reporter: shen


When using processing time window, in some workload, there will be a lot of 
small sst files(serveral KB) in rocksdb local directory and may cause "Too many 
files error".

Use rocksdb tool ldb to find out content in sst files:
 * column family of these small sst files is "processing_window-timers".
 * most sst files are in level-1.
 * records in sst files are almost kTypeDeletion.
 * creation time of sst file correspond to checkpoint interval.

These small sst files seem to be generated when flink checkpoint is triggered. 
Although all content in sst are delete tags, they are not compacted and deleted 
in rocksdb compaction because of not intersecting with each other(rocksdb 
[compaction trivial 
move|https://github.com/facebook/rocksdb/wiki/Compaction-Trivial-Move]). And 
there seems to be no chance to delete them because of small size and not 
intersect with other sst files.

 

I will attach a simple program to reproduce the problem.

 

Since timer in processing time window is generated in strictly ascending 
order(both put and delete). So If workload of job happen to generate level-0 
sst files not intersect with each other.(for example: processing window size 
much smaller than checkpoint interval, and no window content cross checkpoint 
interval or no new data in window crossing checkpoint interval). There will be 
many small sst files generated until job restored from savepoint, or 
incremental checkpoint is disabled. 

 

May be similar problem exists when user use timer in operators with same 
workload.

 

Code to reproduce the problem:
{code:java}
package org.apache.flink;

import lombok.extern.slf4j.Slf4j;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.configuration.RestOptions;
import org.apache.flink.configuration.TaskManagerOptions;
import org.apache.flink.contrib.streaming.state.RocksDBStateBackend;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.checkpoint.ListCheckpointed;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import 
org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.util.Collections;
import java.util.List;
import java.util.Random;

@Slf4j
public class StreamApp  {
  public static void main(String[] args) throws Exception {
Configuration config = new Configuration();
config.set(RestOptions.ADDRESS, "127.0.0.1");
config.set(RestOptions.PORT, 10086);
config.set(TaskManagerOptions.NUM_TASK_SLOTS, 6);
new 
StreamApp().configureApp(StreamExecutionEnvironment.createLocalEnvironment(1, 
config));
  }

  public void configureApp(StreamExecutionEnvironment env) throws Exception {

env.enableCheckpointing(2); // 20sec

RocksDBStateBackend rocksDBStateBackend =
new 
RocksDBStateBackend("file:///Users/shenjiaqi/Workspace/jira/flink-51/checkpoints/",
 true); // need to be reconfigured

rocksDBStateBackend.setDbStoragePath("/Users/shenjiaqi/Workspace/jira/flink-51/flink/rocksdb_local_db");
 // need to be reconfigured

env.setStateBackend(rocksDBStateBackend);
env.getCheckpointConfig().setCheckpointTimeout(10);
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(5);
env.setParallelism(1);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

env.getConfig().setTaskCancellationInterval(1);

for (int i = 0; i < 1; ++i) {
  createOnePipeline(env);
}

env.execute("StreamApp");
  }


  private void createOnePipeline(StreamExecutionEnvironment env) {
// data source is configured so that little window cross checkpoint interval
DataStreamSource stream = env.addSource(new Generator(1, 3000, 
3600));

stream.keyBy(x -> x)
// make sure window size less than checkpoint interval. though 100ms is 
too small, I think increase this value can still reproduce the problem with 
longer time.
.window(TumblingProcessingTimeWindows.of(Time.milliseconds(100)))
.process(new ProcessWindowFunction() {
  @Override
  public void process(String s, ProcessWindowFunction.Context context,
  Iter

Re: Azure Pipelines are dealing with an incident, causing pipeline runs to fail

2022-02-09 Thread Martijn Visser
Hi everyone,

The issue should now be resolved.

Best regards,

Martijn

On Wed, 9 Feb 2022 at 10:55, Martijn Visser  wrote:

> Hi everyone,
>
> Please keep in mind that Azure Pipelines currently is dealing with an
> incident [1] which causes all CI pipeline runs on Azure to fail. When the
> incident has been resolved, it will be required to retrigger your pipeline
> to see if the pipeline then passes.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
>
> [1] https://status.dev.azure.com/_event/287959626
>


[jira] [Created] (FLINK-26052) Update chinese documentation regarding FLIP-203

2022-02-09 Thread Dawid Wysakowicz (Jira)
Dawid Wysakowicz created FLINK-26052:


 Summary: Update chinese documentation regarding FLIP-203
 Key: FLINK-26052
 URL: https://issues.apache.org/jira/browse/FLINK-26052
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation, Runtime / Checkpointing
Reporter: Dawid Wysakowicz


Relevant english commits: 
* c1f5c5320150402fc0cb4fbf3a31f9a27b1e4d9a
* cd8ea8d5b207569f68acc5a3c8db95cd2ca47ba6



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[DISCUSS] FLINK-25927: Make flink-connector-base dependency usage consistent across all connectors

2022-02-09 Thread Alexander Fedulov
Hi everyone,

I would like to discuss the best approach to address the issue raised
in FLINK-25927 [1]. It can be summarized as follows:

> flink-connector-base is currently inconsistently used in connectors
(directly shaded in some and transitively pulled in via
flink-connector-files which is itself shaded in the table uber jar)

> FLINK-24687 [2] moved flink-connector-files out of the flink-table  uber
jar

> It is necessary to make usage of flink-connector-base consistent across
all connectors

One approach is to stop shading flink-connector-files in all connectors and
instead package it in flink-dist, making it a part of Flink-wide provided
public API. This approach is investigated in the following PoC PR: 18545
[3].  The issue with this approach is that it breaks any existing CI and
IDE setups that do not directly rely on flink-dist and also do not include
flink-connector-files as an explicit dependency.

In theory, a nice alternative would be to make it a part of a dependency
that is ubiquitously provided, for instance, flink-streaming-java. Doing
that for flink-streaming-java would, however,  introduce a dependency cycle
and is currently not feasible.

It would be great to hear your opinions on what could be the best way
forward here.

[1] https://issues.apache.org/jira/browse/FLINK-25927
[2] https://issues.apache.org/jira/browse/FLINK-24687
[3] https://github.com/apache/flink/pull/18545


Thanks,
Alexander Fedulov


[DISCUSS] Drop Jepsen tests

2022-02-09 Thread Chesnay Schepler
For a few years by now we had a set of Jepsen tests that verify the 
correctness of Flinks coordination layer in the case of process crashes.
In the past it has indeed found issues and thus provided value to the 
project, and in general the core idea of it (and Jepsen for that matter) 
is very sound.


However, so far we neither made attempts to make further use of Jepsen 
(and limited ourselves to very basic tests) nor to familiarize ourselves 
with the tests/jepsen at all.
As a result these tests are difficult to maintain. They (and Jepsen) are 
written in Clojure, which makes debugging, changes and upstreaming 
contributions very difficult.
Additionally, the tests also make use of a very complicated 
(Ververica-internal) terraform+ansible setup to spin up and tear down 
AWS machines. While it works (and is actually pretty cool), it's 
difficult to adjust because the people who wrote it have left the company.


Why I'm raising this now (and not earlier) is because so far keeping the 
tests running wasn't much of a problem; bump a few dependencies here and 
there and we're good to go.


However, this has changed with the recent upgrade to Zookeeper 3.5, 
which isn't supported by Jepsen out-of-the-box, completely breaking the 
tests. We'd now have to write a new Zookeeper 3.5+ integration for 
Jepsen (again, in Clojure). While I started working on that and could 
likely finish it, I started to wonder whether it even makes sense to do 
so, and whether we couldn't invest this time elsewhere.


Let me know what you think.



Re: Azure Pipelines are dealing with an incident, causing pipeline runs to fail

2022-02-09 Thread Martijn Visser
Unfortunately it looks like there are still failures. Will keep you posted

On Wed, 9 Feb 2022 at 11:51, Martijn Visser  wrote:

> Hi everyone,
>
> The issue should now be resolved.
>
> Best regards,
>
> Martijn
>
> On Wed, 9 Feb 2022 at 10:55, Martijn Visser  wrote:
>
>> Hi everyone,
>>
>> Please keep in mind that Azure Pipelines currently is dealing with an
>> incident [1] which causes all CI pipeline runs on Azure to fail. When the
>> incident has been resolved, it will be required to retrigger your pipeline
>> to see if the pipeline then passes.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>>
>> [1] https://status.dev.azure.com/_event/287959626
>>
>


Re: Azure Pipelines are dealing with an incident, causing pipeline runs to fail

2022-02-09 Thread Robert Metzger
I filed a support request with Microsoft:
https://developercommunity.visualstudio.com/t/Number-of-Microsoft-hosted-agents-droppe/1658827?from=email&space=21&sort=newest

On Wed, Feb 9, 2022 at 1:04 PM Martijn Visser  wrote:

> Unfortunately it looks like there are still failures. Will keep you posted
>
> On Wed, 9 Feb 2022 at 11:51, Martijn Visser  wrote:
>
> > Hi everyone,
> >
> > The issue should now be resolved.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Wed, 9 Feb 2022 at 10:55, Martijn Visser 
> wrote:
> >
> >> Hi everyone,
> >>
> >> Please keep in mind that Azure Pipelines currently is dealing with an
> >> incident [1] which causes all CI pipeline runs on Azure to fail. When
> the
> >> incident has been resolved, it will be required to retrigger your
> pipeline
> >> to see if the pipeline then passes.
> >>
> >> Best regards,
> >>
> >> Martijn Visser
> >> https://twitter.com/MartijnVisser82
> >>
> >> [1] https://status.dev.azure.com/_event/287959626
> >>
> >
>


[jira] [Created] (FLINK-26053) Fix parser generator warnings

2022-02-09 Thread Francesco Guardiani (Jira)
Francesco Guardiani created FLINK-26053:
---

 Summary: Fix parser generator warnings
 Key: FLINK-26053
 URL: https://issues.apache.org/jira/browse/FLINK-26053
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / API
Reporter: Francesco Guardiani


When building flink-sql-parser, the javacc logs a couple of warnings:


{code}
Warning: Choice conflict in [...] construct at line 2920, column 5.
 Expansion nested within construct and expansion following construct
 have common prefixes, one of which is: "PLAN"
 Consider using a lookahead of 2 or more for nested expansion.
Warning: Choice conflict involving two expansions at
 line 2930, column 9 and line 2932, column 9 respectively.
 A common prefix is: "STATEMENT"
 Consider using a lookahead of 2 for earlier expansion.
Warning: Choice conflict involving two expansions at
 line 2952, column 9 and line 2954, column 9 respectively.
 A common prefix is: "STATEMENT"
 Consider using a lookahead of 2 for earlier expansion.
{code}

Took from 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30871&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=075c2716-8010-5565-fe08-3c4bb45824a4&l=3911

We should investigate them, as they're often symptom of some bug in our parser 
template, which might result in unexpected parsing errors.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26054) Enable maven-enforcer to disable table-planner and table-runtime as dependencies

2022-02-09 Thread Francesco Guardiani (Jira)
Francesco Guardiani created FLINK-26054:
---

 Summary: Enable maven-enforcer to disable table-planner and 
table-runtime as dependencies
 Key: FLINK-26054
 URL: https://issues.apache.org/jira/browse/FLINK-26054
 Project: Flink
  Issue Type: Bug
  Components: Build System, Table SQL / Planner
Reporter: Francesco Guardiani
Assignee: Francesco Guardiani


>From https://github.com/apache/flink/pull/18676#discussion_r802502438, we 
>should enable the enforcer for table-planner and table-runtime modules, as 
>described in the flink-table/README.md



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26055) Use new FlinkVersion enum for serde plans

2022-02-09 Thread Timo Walther (Jira)
Timo Walther created FLINK-26055:


 Summary: Use new FlinkVersion enum for serde plans
 Key: FLINK-26055
 URL: https://issues.apache.org/jira/browse/FLINK-26055
 Project: Flink
  Issue Type: Sub-task
Reporter: Timo Walther


The new FlinkVersion enum should be used to encode a Flink version from and to 
JSON.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26056) CLONE - [FLIP-171] KDS implementation of Async Sink Table API

2022-02-09 Thread Ahmed Hamdy (Jira)
Ahmed Hamdy created FLINK-26056:
---

 Summary: CLONE - [FLIP-171] KDS implementation of Async Sink Table 
API
 Key: FLINK-26056
 URL: https://issues.apache.org/jira/browse/FLINK-26056
 Project: Flink
  Issue Type: New Feature
  Components: Connectors / Kinesis
Reporter: Ahmed Hamdy
Assignee: Ahmed Hamdy
 Fix For: 1.15.0


h2. Motivation

*User stories:*
As a Flink user, I’d like to use the Table API for the new Kinesis Data Streams 
 sink.

*Scope:*
 * Introduce {{AsyncDynamicTableSink}} that enables Sinking Tables into Async 
Implementations.
 * Implement a new {{KinesisDynamicTableSink}} that uses 
{{KinesisDataStreamSink}} Async Implementation and implements 
{{{}AsyncDynamicTableSink{}}}.
 * The implementation introduces Async Sink configurations as optional options 
in the table definition, with default values derived from the 
{{KinesisDataStream}} default values.
 * Unit/Integration testing. modify KinesisTableAPI tests for the new 
implementation, add unit tests for {{AsyncDynamicTableSink}} and 
{{KinesisDynamicTableSink}} and {{{}KinesisDynamicTableSinkFactory{}}}.
 * Java / code-level docs.

h2. References

More details to be found 
[https://cwiki.apache.org/confluence/display/FLINK/FLIP-171%3A+Async+Sink]

 

*Update:*

^^ Status Update ^^
__List of all work outstanding for 1.15 release__

[Merged] https://github.com/apache/flink/pull/18165 - KDS DataStream Docs
[Merged] https://github.com/apache/flink/pull/18396 - [hotfix] for infinte loop 
if not flushing during commit
[Merged] https://github.com/apache/flink/pull/18421 - Mark Kinesis Producer as 
deprecated (Prod: FLINK-24227)
[Merged] https://github.com/apache/flink/pull/18348 - KDS Table API Sink & Docs
[Merged] https://github.com/apache/flink/pull/18488 - base sink retry entries 
in order not in reverse
[Merged] https://github.com/apache/flink/pull/18512 - changing failed requests 
handler to accept List in AsyncSinkWriter
[Merged] https://github.com/apache/flink/pull/18483 - Do not expose the element 
converter
[Merged] https://github.com/apache/flink/pull/18468 - Adding Kinesis data 
streams sql uber-jar

Ready for review:
[SUCCESS ] https://github.com/apache/flink/pull/18314 - KDF DataStream Sink & 
Docs
[BLOCKED on ^^ ] https://github.com/apache/flink/pull/18426 - rename 
flink-connector-aws-kinesis-data-* to flink-connector-aws-kinesis-* (module 
names) and KinesisData*Sink to Kinesis*Sink (class names)

Pending PR:
* Firehose Table API Sink & Docs
* KDF Table API SQL jar

TBD:
* FLINK-25846: Not shutting down
* FLINK-25848: Validation during start up
* FLINK-25792: flush() bug
* FLINK-25793: throughput exceeded
* Update the defaults of KDS sink and update the docs + do the same for KDF
* add a `AsyncSinkCommonConfig` class (to wrap the 6 params) to the 
`flink-connector-base` and propagate it to the two connectors
- feature freeze
* KDS performance testing
* KDF performance testing
* Clone the new docs to .../contents.zh/... and add the location to the 
corresponding Chinese translation jira - KDS -
* Rename [Amazon AWS Kinesis Streams] to [Amazon Kinesis Data Streams] in docs 
(legacy issue)
- Flink 1.15 release
* KDS end to end sanity test - hits aws apis rather than local docker images
* KDS Python wrappers
* FLINK-25733 - Create A migration guide for Kinesis Table API connector - can 
happen after 1.15
* If `endpoint` is provided, `region` should not be required like it currently 
is
* Test if Localstack container requires the 1ms timeout
* Adaptive level of logging (in discussion)

FYI:
* FLINK-25661 - Add Custom Fatal Exception handler in AsyncSinkWriter - 
https://github.com/apache/flink/pull/18449
* https://issues.apache.org/jira/browse/FLINK-24229 DDB Sink

Chinese translation:
https://issues.apache.org/jira/browse/FLINK-25735 - KDS DataStream Sink
https://issues.apache.org/jira/browse/FLINK-25736 - KDS Table API Sink
https://issues.apache.org/jira/browse/FLINK-25737 - KDF DataStream Sink



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26057) CLONE - [FLIP-171] Add SQL Client Uber jar for KinesisDataStreams Sink.

2022-02-09 Thread Ahmed Hamdy (Jira)
Ahmed Hamdy created FLINK-26057:
---

 Summary: CLONE - [FLIP-171] Add SQL Client Uber jar for 
KinesisDataStreams Sink.
 Key: FLINK-26057
 URL: https://issues.apache.org/jira/browse/FLINK-26057
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Kinesis
Reporter: Ahmed Hamdy
Assignee: Ahmed Hamdy
 Fix For: 1.15.0


h2. Motivation

*User stories:*
As a Flink user, I’d like to have an uber-jar for kinesis-data-streams sink sql 
connector with all dependencies without the need for all source dependencies.

*Scope:*
* Add a new module for {{flink-sql-connector-kinesis-data-streams}} which 
shades {{flink-connector-kinesis-data-streams}} with needed dependencies.
* Add end to end tests for the new connector.

h2. References

More details to be found 
[https://cwiki.apache.org/confluence/display/FLINK/FLIP-171%3A+Async+Sink]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26058) KinesisDynamicTableFactory implements DynamicSourceFactory but uses sink options

2022-02-09 Thread Ahmed Hamdy (Jira)
Ahmed Hamdy created FLINK-26058:
---

 Summary: KinesisDynamicTableFactory implements 
DynamicSourceFactory but uses sink options
 Key: FLINK-26058
 URL: https://issues.apache.org/jira/browse/FLINK-26058
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kinesis
Reporter: Ahmed Hamdy


{{h2. Description
KinesisDynamicTableFactory}} implements {{DynamicTableSourceFactory}} while 
{{KinesisDynamicTableSinkFactory}} implements {{DynamicTableSinkFactory, 
however KinesisDynamicTableFactory adds {{SINK_PARTITIONER and 
{{SINK_PARTITION_DELIMITER}} options specific to sink in optional table options.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26059) ResourceProfile#ANY can cause overflow exceptions

2022-02-09 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-26059:


 Summary: ResourceProfile#ANY can cause overflow exceptions
 Key: FLINK-26059
 URL: https://issues.apache.org/jira/browse/FLINK-26059
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Chesnay Schepler


ResourceProfile#ANY uses MemorySize.MAX_VALUE for all memory settings. Some of 
the getters add up some of the memory profiles, causing an overflow.

This implies that you must never access any getter without first comparing a 
resource profile against the ANY singleton.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26060) Make Python specific exec nodes unsupported

2022-02-09 Thread Francesco Guardiani (Jira)
Francesco Guardiani created FLINK-26060:
---

 Summary: Make Python specific exec nodes unsupported 
 Key: FLINK-26060
 URL: https://issues.apache.org/jira/browse/FLINK-26060
 Project: Flink
  Issue Type: Sub-task
Reporter: Francesco Guardiani
Assignee: Francesco Guardiani


These nodes are using the old type system, which is going to be removed soon. 
We should avoid supporting them in the persisted plan, as we cannot commit to 
support them. Once migrated to the new type system, PyFlink won't need these 
nodes anymore and will just rely on Table new function stack. For more details, 
also check https://issues.apache.org/jira/browse/FLINK-25231



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Konstantin Knauf
Thank you for raising this issue. What risks do you see if we drop it? Do
you see any cheaper alternative to (partially) mitigate those risks?

On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler  wrote:

> For a few years by now we had a set of Jepsen tests that verify the
> correctness of Flinks coordination layer in the case of process crashes.
> In the past it has indeed found issues and thus provided value to the
> project, and in general the core idea of it (and Jepsen for that matter)
> is very sound.
>
> However, so far we neither made attempts to make further use of Jepsen
> (and limited ourselves to very basic tests) nor to familiarize ourselves
> with the tests/jepsen at all.
> As a result these tests are difficult to maintain. They (and Jepsen) are
> written in Clojure, which makes debugging, changes and upstreaming
> contributions very difficult.
> Additionally, the tests also make use of a very complicated
> (Ververica-internal) terraform+ansible setup to spin up and tear down
> AWS machines. While it works (and is actually pretty cool), it's
> difficult to adjust because the people who wrote it have left the company.
>
> Why I'm raising this now (and not earlier) is because so far keeping the
> tests running wasn't much of a problem; bump a few dependencies here and
> there and we're good to go.
>
> However, this has changed with the recent upgrade to Zookeeper 3.5,
> which isn't supported by Jepsen out-of-the-box, completely breaking the
> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> Jepsen (again, in Clojure). While I started working on that and could
> likely finish it, I started to wonder whether it even makes sense to do
> so, and whether we couldn't invest this time elsewhere.
>
> Let me know what you think.
>
>

-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk


Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Chesnay Schepler

The jepsen tests cover 3 cases:
a) JM/TM crashes
b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
c) network partitions

a) can (and probably is) reasonably covered by existing ITCases and e2e 
tests

b) We could probably figure this out ourselves if we wanted to.
c) is the difficult part.

Note that the tests also only cover yarn (per-job/session) and 
standalone (session) deployments.


On 09/02/2022 17:11, Konstantin Knauf wrote:

Thank you for raising this issue. What risks do you see if we drop it? Do
you see any cheaper alternative to (partially) mitigate those risks?

On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler  wrote:


For a few years by now we had a set of Jepsen tests that verify the
correctness of Flinks coordination layer in the case of process crashes.
In the past it has indeed found issues and thus provided value to the
project, and in general the core idea of it (and Jepsen for that matter)
is very sound.

However, so far we neither made attempts to make further use of Jepsen
(and limited ourselves to very basic tests) nor to familiarize ourselves
with the tests/jepsen at all.
As a result these tests are difficult to maintain. They (and Jepsen) are
written in Clojure, which makes debugging, changes and upstreaming
contributions very difficult.
Additionally, the tests also make use of a very complicated
(Ververica-internal) terraform+ansible setup to spin up and tear down
AWS machines. While it works (and is actually pretty cool), it's
difficult to adjust because the people who wrote it have left the company.

Why I'm raising this now (and not earlier) is because so far keeping the
tests running wasn't much of a problem; bump a few dependencies here and
there and we're good to go.

However, this has changed with the recent upgrade to Zookeeper 3.5,
which isn't supported by Jepsen out-of-the-box, completely breaking the
tests. We'd now have to write a new Zookeeper 3.5+ integration for
Jepsen (again, in Clojure). While I started working on that and could
likely finish it, I started to wonder whether it even makes sense to do
so, and whether we couldn't invest this time elsewhere.

Let me know what you think.






Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Chesnay Schepler

b/c are part of the same test.

 * We have a job running,
 * trigger a network partition (failing the job),
 * then crash HDFS (preventing checkpoints and access to the HA
   storageDir),
 * then the partition is resolved and HDFS is started again.

Conceptually I would think we can replicate this by nuking half the 
cluster, crashing HDFS/ZK, and restarting everything.


On 09/02/2022 17:39, Chesnay Schepler wrote:

The jepsen tests cover 3 cases:
a) JM/TM crashes
b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
c) network partitions

a) can (and probably is) reasonably covered by existing ITCases and 
e2e tests

b) We could probably figure this out ourselves if we wanted to.
c) is the difficult part.

Note that the tests also only cover yarn (per-job/session) and 
standalone (session) deployments.


On 09/02/2022 17:11, Konstantin Knauf wrote:
Thank you for raising this issue. What risks do you see if we drop 
it? Do

you see any cheaper alternative to (partially) mitigate those risks?

On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler  
wrote:



For a few years by now we had a set of Jepsen tests that verify the
correctness of Flinks coordination layer in the case of process 
crashes.

In the past it has indeed found issues and thus provided value to the
project, and in general the core idea of it (and Jepsen for that 
matter)

is very sound.

However, so far we neither made attempts to make further use of Jepsen
(and limited ourselves to very basic tests) nor to familiarize 
ourselves

with the tests/jepsen at all.
As a result these tests are difficult to maintain. They (and Jepsen) 
are

written in Clojure, which makes debugging, changes and upstreaming
contributions very difficult.
Additionally, the tests also make use of a very complicated
(Ververica-internal) terraform+ansible setup to spin up and tear down
AWS machines. While it works (and is actually pretty cool), it's
difficult to adjust because the people who wrote it have left the 
company.


Why I'm raising this now (and not earlier) is because so far keeping 
the
tests running wasn't much of a problem; bump a few dependencies here 
and

there and we're good to go.

However, this has changed with the recent upgrade to Zookeeper 3.5,
which isn't supported by Jepsen out-of-the-box, completely breaking the
tests. We'd now have to write a new Zookeeper 3.5+ integration for
Jepsen (again, in Clojure). While I started working on that and could
likely finish it, I started to wonder whether it even makes sense to do
so, and whether we couldn't invest this time elsewhere.

Let me know what you think.






Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread David Morávek
Network partitions are trickier than simply crashing process. For example
these can be asymmetric -> as a TM you're still able to talk to the JM, but
you're not able to talk to other TMs.

In general this could be achieved by manipulating iptables on the host
machine (considering we spawn all the processes locally), but not sure if
that will solve the "make it less complicated for others to contribute"
part :/ Also this kind of test would be executable on nix systems only.

I assume that jepsen uses the same approach under the hood.

D.

On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler  wrote:

> b/c are part of the same test.
>
>   * We have a job running,
>   * trigger a network partition (failing the job),
>   * then crash HDFS (preventing checkpoints and access to the HA
> storageDir),
>   * then the partition is resolved and HDFS is started again.
>
> Conceptually I would think we can replicate this by nuking half the
> cluster, crashing HDFS/ZK, and restarting everything.
>
> On 09/02/2022 17:39, Chesnay Schepler wrote:
> > The jepsen tests cover 3 cases:
> > a) JM/TM crashes
> > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > c) network partitions
> >
> > a) can (and probably is) reasonably covered by existing ITCases and
> > e2e tests
> > b) We could probably figure this out ourselves if we wanted to.
> > c) is the difficult part.
> >
> > Note that the tests also only cover yarn (per-job/session) and
> > standalone (session) deployments.
> >
> > On 09/02/2022 17:11, Konstantin Knauf wrote:
> >> Thank you for raising this issue. What risks do you see if we drop
> >> it? Do
> >> you see any cheaper alternative to (partially) mitigate those risks?
> >>
> >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler 
> >> wrote:
> >>
> >>> For a few years by now we had a set of Jepsen tests that verify the
> >>> correctness of Flinks coordination layer in the case of process
> >>> crashes.
> >>> In the past it has indeed found issues and thus provided value to the
> >>> project, and in general the core idea of it (and Jepsen for that
> >>> matter)
> >>> is very sound.
> >>>
> >>> However, so far we neither made attempts to make further use of Jepsen
> >>> (and limited ourselves to very basic tests) nor to familiarize
> >>> ourselves
> >>> with the tests/jepsen at all.
> >>> As a result these tests are difficult to maintain. They (and Jepsen)
> >>> are
> >>> written in Clojure, which makes debugging, changes and upstreaming
> >>> contributions very difficult.
> >>> Additionally, the tests also make use of a very complicated
> >>> (Ververica-internal) terraform+ansible setup to spin up and tear down
> >>> AWS machines. While it works (and is actually pretty cool), it's
> >>> difficult to adjust because the people who wrote it have left the
> >>> company.
> >>>
> >>> Why I'm raising this now (and not earlier) is because so far keeping
> >>> the
> >>> tests running wasn't much of a problem; bump a few dependencies here
> >>> and
> >>> there and we're good to go.
> >>>
> >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> >>> which isn't supported by Jepsen out-of-the-box, completely breaking the
> >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> >>> Jepsen (again, in Clojure). While I started working on that and could
> >>> likely finish it, I started to wonder whether it even makes sense to do
> >>> so, and whether we couldn't invest this time elsewhere.
> >>>
> >>> Let me know what you think.
> >>>
> >>>
> >
>


Re: Statefun async http request via RequestReplyFunctionBuilder

2022-02-09 Thread Galen Warren
Thanks! PR created and noted on the ticket.

On Wed, Feb 9, 2022 at 3:42 AM Till Rohrmann  wrote:

> Hi Galen,
>
> Great to hear it :-) I've assigned you to the ticket [1]. Next you can open
> a PR against the repository and we will review it.
>
> [1] https://issues.apache.org/jira/browse/FLINK-25933
>
> Cheers,
> Till
>
> On Tue, Feb 8, 2022 at 11:58 PM Galen Warren 
> wrote:
>
> > I'm ready to pick this one up, I have some code that's working locally.
> > Shall I create a PR?
> >
> >
> >
> > On Wed, Feb 2, 2022 at 3:17 PM Igal Shilman 
> > wrote:
> >
> > > Great, ping me when you would like to pick this up.
> > >
> > > For the related issue, I think that can be a good addition indeed!
> > >
> > > On Wed, Feb 2, 2022 at 8:55 PM Galen Warren 
> > > wrote:
> > >
> > > > Gotcha, thanks. I may be able to work on that one in a couple weeks
> if
> > > > you're looking for help.
> > > >
> > > > Unrelated question -- another thing that would be useful for me would
> > be
> > > > the ability to set a maximum backoff interval in
> > > BoundedExponentialBackoff
> > > > or the async equivalent. My situation is this. I'd like to set a long
> > > > maxRequestDuration, so it takes a good while for Statefun to "give
> up"
> > > on a
> > > > function call, i.e. perhaps several hours or even days. During that
> > time,
> > > > if the backoff interval doubles on each failure, those backoff
> > intervals
> > > > get pretty long.
> > > >
> > > > Sometimes, in exponential backoff implementations, I've seen the
> > concept
> > > of
> > > > a max backoff interval, i.e. once the backoff interval reaches that
> > > point,
> > > > it won't go any higher. So I could set it to, say, 60 seconds, and no
> > > > matter how long it would retry the function, the interval between
> > retries
> > > > wouldn't be more than that.
> > > >
> > > > Do you think that would be a useful addition? I could post something
> to
> > > the
> > > > dev list if you want.
> > > >
> > > > On Wed, Feb 2, 2022 at 2:39 PM Igal Shilman  wrote:
> > > >
> > > > > Hi Galen,
> > > > > You are right, it is not possible, but there is no real reason for
> > > that.
> > > > > We should fix this, and I've created the following JIRA issue [1]
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-25933
> > > > >
> > > > > On Wed, Feb 2, 2022 at 6:30 PM Galen Warren <
> ga...@cvillewarrens.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Is it possible to choose the async HTTP transport using
> > > > > > RequestReplyFunctionBuilder? It looks to me that it is not, but I
> > > > wanted
> > > > > to
> > > > > > double check. Thanks.
> > > > > >
> > > > >
> > > >
> > >
> >
>


[jira] [Created] (FLINK-26061) FileSystem.getFileStatus fails for directories

2022-02-09 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-26061:
-

 Summary: FileSystem.getFileStatus fails for directories
 Key: FLINK-26061
 URL: https://issues.apache.org/jira/browse/FLINK-26061
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / FileSystem
Affects Versions: 1.15.0
Reporter: Matthias Pohl


{{FileSystem.getFileStatus}} is not supported for directories in object stores 
(like s3). We might want to adjust the interface here. It's not clear based on 
the contract that for certain {{FileSystem}} implementations. This is 
especially unintuitive because there is a {{isDir()}} method provided by the 
resulting {{FileStatus}} instance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26062) [Changelog] Non-deterministic recovery of PriorityQueue states

2022-02-09 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-26062:
-

 Summary: [Changelog] Non-deterministic recovery of PriorityQueue 
states
 Key: FLINK-26062
 URL: https://issues.apache.org/jira/browse/FLINK-26062
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.15.0
Reporter: Roman Khachatryan
Assignee: Roman Khachatryan
 Fix For: 1.15.0


Currently, InternalPriorityQueue.poll() is logged as a separate operation, 
without specifying the element that has been polled. On recovery, this recorded 
poll() is replayed.

However, this is not deterministic because the order of PQ elements with equal 
priorityis not specified. For example, TimerHeapInternalTimer only compares 
timestamps, which are often equal. This results in polling timers from queue in 
wrong order => dropping timers => and not firing timers.

 

ProcessingTimeWindowCheckpointingITCase.testAggregatingSlidingProcessingTimeWindow
 fails with materialization enabled and using heap state backend (both 
in-memory and fs-based implementations).

 

Proposed solution is to replace poll with remove operation (which is based on 
equality).
 
cc: [~masteryhx], [~ym], [~yunta]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26063) [Changelog] Incorrect key group logged for PQ.poll and remove

2022-02-09 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-26063:
-

 Summary: [Changelog] Incorrect key group logged for PQ.poll and 
remove
 Key: FLINK-26063
 URL: https://issues.apache.org/jira/browse/FLINK-26063
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.15.0
Reporter: Roman Khachatryan
Assignee: Roman Khachatryan
 Fix For: 1.15.0


Key group is logged so that state changes can be re-distributed or shuffled.
It is currently obtained from keyContext during poll() and remove() operations.
However, keyContext is not updated when dequeing processing time timers.

The impact is relatively small for remove(): in the worst case, the operation 
will be ignored.
poll() should probably be replaced with remove() anyways - see FLINK-26062.

One way to solve this problem is to extract key group from the polled element - 
if it is a timer.

cc: [~masteryhx], [~ym], [~yunta]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26064) KinesisFirehoseSinkITCase IllegalStateException: Trying to access closed classloader

2022-02-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26064:
--

 Summary: KinesisFirehoseSinkITCase IllegalStateException: Trying 
to access closed classloader
 Key: FLINK-26064
 URL: https://issues.apache.org/jira/browse/FLINK-26064
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kinesis
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=31044&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=ed165f3f-d0f6-524b-5279-86f8ee7d0e2d

(shortened stack trace, as full is too large)
{noformat}
Feb 09 20:05:04 java.util.concurrent.ExecutionException: 
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute 
HTTP request: Trying to access closed classloader. Please check if you store 
classloaders directly or indirectly in static fields. If the stacktrace 
suggests that the leak occurs in a third party library and cannot be fixed 
immediately, you can disable this check with the configuration 
'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
Feb 09 20:05:04 at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
(...)
Feb 09 20:05:04 Caused by: 
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute 
HTTP request: Trying to access closed classloader. Please check if you store 
classloaders directly or indirectly in static fields. If the stacktrace 
suggests that the leak occurs in a third party library and cannot be fixed 
immediately, you can disable this check with the configuration 
'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)
(...)
Feb 09 20:05:04 Caused by: java.lang.IllegalStateException: Trying to access 
closed classloader. Please check if you store classloaders directly or 
indirectly in static fields. If the stacktrace suggests that the leak occurs in 
a third party library and cannot be fixed immediately, you can disable this 
check with the configuration 'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:164)
Feb 09 20:05:04 at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.getResources(FlinkUserCodeClassLoaders.java:188)
Feb 09 20:05:04 at 
java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:348)
Feb 09 20:05:04 at 
java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
Feb 09 20:05:04 at 
java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder$1.run(FactoryFinder.java:352)
Feb 09 20:05:04 at java.security.AccessController.doPrivileged(Native 
Method)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.findServiceProvider(FactoryFinder.java:341)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.find(FactoryFinder.java:313)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.find(FactoryFinder.java:227)
Feb 09 20:05:04 at 
javax.xml.stream.XMLInputFactory.newInstance(XMLInputFactory.java:154)
Feb 09 20:05:04 at 
software.amazon.awssdk.protocols.query.unmarshall.XmlDomParser.createXmlInputFactory(XmlDomParser.java:124)
Feb 09 20:05:04 at 
java.lang.ThreadLocal$SuppliedThreadLocal.initialValue(ThreadLocal.java:284)
Feb 09 20:05:04 at 
java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:180)
Feb 09 20:05:04 at java.lang.ThreadLocal.get(ThreadLocal.java:170)
(...)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Austin Cawley-Edwards
Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1]
would be an option to simulate asymmetric network partitions without
modifying iptables in a more approachable way?

Austin

[1]:
https://kubernetes.io/docs/concepts/services-networking/network-policies/


On Wed, Feb 9, 2022 at 12:40 PM David Morávek  wrote:

> Network partitions are trickier than simply crashing process. For example
> these can be asymmetric -> as a TM you're still able to talk to the JM, but
> you're not able to talk to other TMs.
>
> In general this could be achieved by manipulating iptables on the host
> machine (considering we spawn all the processes locally), but not sure if
> that will solve the "make it less complicated for others to contribute"
> part :/ Also this kind of test would be executable on nix systems only.
>
> I assume that jepsen uses the same approach under the hood.
>
> D.
>
> On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler 
> wrote:
>
> > b/c are part of the same test.
> >
> >   * We have a job running,
> >   * trigger a network partition (failing the job),
> >   * then crash HDFS (preventing checkpoints and access to the HA
> > storageDir),
> >   * then the partition is resolved and HDFS is started again.
> >
> > Conceptually I would think we can replicate this by nuking half the
> > cluster, crashing HDFS/ZK, and restarting everything.
> >
> > On 09/02/2022 17:39, Chesnay Schepler wrote:
> > > The jepsen tests cover 3 cases:
> > > a) JM/TM crashes
> > > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > > c) network partitions
> > >
> > > a) can (and probably is) reasonably covered by existing ITCases and
> > > e2e tests
> > > b) We could probably figure this out ourselves if we wanted to.
> > > c) is the difficult part.
> > >
> > > Note that the tests also only cover yarn (per-job/session) and
> > > standalone (session) deployments.
> > >
> > > On 09/02/2022 17:11, Konstantin Knauf wrote:
> > >> Thank you for raising this issue. What risks do you see if we drop
> > >> it? Do
> > >> you see any cheaper alternative to (partially) mitigate those risks?
> > >>
> > >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler 
> > >> wrote:
> > >>
> > >>> For a few years by now we had a set of Jepsen tests that verify the
> > >>> correctness of Flinks coordination layer in the case of process
> > >>> crashes.
> > >>> In the past it has indeed found issues and thus provided value to the
> > >>> project, and in general the core idea of it (and Jepsen for that
> > >>> matter)
> > >>> is very sound.
> > >>>
> > >>> However, so far we neither made attempts to make further use of
> Jepsen
> > >>> (and limited ourselves to very basic tests) nor to familiarize
> > >>> ourselves
> > >>> with the tests/jepsen at all.
> > >>> As a result these tests are difficult to maintain. They (and Jepsen)
> > >>> are
> > >>> written in Clojure, which makes debugging, changes and upstreaming
> > >>> contributions very difficult.
> > >>> Additionally, the tests also make use of a very complicated
> > >>> (Ververica-internal) terraform+ansible setup to spin up and tear down
> > >>> AWS machines. While it works (and is actually pretty cool), it's
> > >>> difficult to adjust because the people who wrote it have left the
> > >>> company.
> > >>>
> > >>> Why I'm raising this now (and not earlier) is because so far keeping
> > >>> the
> > >>> tests running wasn't much of a problem; bump a few dependencies here
> > >>> and
> > >>> there and we're good to go.
> > >>>
> > >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> > >>> which isn't supported by Jepsen out-of-the-box, completely breaking
> the
> > >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> > >>> Jepsen (again, in Clojure). While I started working on that and could
> > >>> likely finish it, I started to wonder whether it even makes sense to
> do
> > >>> so, and whether we couldn't invest this time elsewhere.
> > >>>
> > >>> Let me know what you think.
> > >>>
> > >>>
> > >
> >
>


Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Yang Wang
@Austin
We already have some e2e tests[1] that guards k8s deployment(both session
and application, with or without HA).
And I agree with you that network partition could be simulated by K8s
network policy.


[1].
https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/test_kubernetes_application_ha.sh

Best,
Yang

Austin Cawley-Edwards  于2022年2月10日周四 05:12写道:

> Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1]
> would be an option to simulate asymmetric network partitions without
> modifying iptables in a more approachable way?
>
> Austin
>
> [1]:
> https://kubernetes.io/docs/concepts/services-networking/network-policies/
>
>
> On Wed, Feb 9, 2022 at 12:40 PM David Morávek  wrote:
>
> > Network partitions are trickier than simply crashing process. For example
> > these can be asymmetric -> as a TM you're still able to talk to the JM,
> but
> > you're not able to talk to other TMs.
> >
> > In general this could be achieved by manipulating iptables on the host
> > machine (considering we spawn all the processes locally), but not sure if
> > that will solve the "make it less complicated for others to contribute"
> > part :/ Also this kind of test would be executable on nix systems only.
> >
> > I assume that jepsen uses the same approach under the hood.
> >
> > D.
> >
> > On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler 
> > wrote:
> >
> > > b/c are part of the same test.
> > >
> > >   * We have a job running,
> > >   * trigger a network partition (failing the job),
> > >   * then crash HDFS (preventing checkpoints and access to the HA
> > > storageDir),
> > >   * then the partition is resolved and HDFS is started again.
> > >
> > > Conceptually I would think we can replicate this by nuking half the
> > > cluster, crashing HDFS/ZK, and restarting everything.
> > >
> > > On 09/02/2022 17:39, Chesnay Schepler wrote:
> > > > The jepsen tests cover 3 cases:
> > > > a) JM/TM crashes
> > > > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > > > c) network partitions
> > > >
> > > > a) can (and probably is) reasonably covered by existing ITCases and
> > > > e2e tests
> > > > b) We could probably figure this out ourselves if we wanted to.
> > > > c) is the difficult part.
> > > >
> > > > Note that the tests also only cover yarn (per-job/session) and
> > > > standalone (session) deployments.
> > > >
> > > > On 09/02/2022 17:11, Konstantin Knauf wrote:
> > > >> Thank you for raising this issue. What risks do you see if we drop
> > > >> it? Do
> > > >> you see any cheaper alternative to (partially) mitigate those risks?
> > > >>
> > > >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <
> ches...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> For a few years by now we had a set of Jepsen tests that verify the
> > > >>> correctness of Flinks coordination layer in the case of process
> > > >>> crashes.
> > > >>> In the past it has indeed found issues and thus provided value to
> the
> > > >>> project, and in general the core idea of it (and Jepsen for that
> > > >>> matter)
> > > >>> is very sound.
> > > >>>
> > > >>> However, so far we neither made attempts to make further use of
> > Jepsen
> > > >>> (and limited ourselves to very basic tests) nor to familiarize
> > > >>> ourselves
> > > >>> with the tests/jepsen at all.
> > > >>> As a result these tests are difficult to maintain. They (and
> Jepsen)
> > > >>> are
> > > >>> written in Clojure, which makes debugging, changes and upstreaming
> > > >>> contributions very difficult.
> > > >>> Additionally, the tests also make use of a very complicated
> > > >>> (Ververica-internal) terraform+ansible setup to spin up and tear
> down
> > > >>> AWS machines. While it works (and is actually pretty cool), it's
> > > >>> difficult to adjust because the people who wrote it have left the
> > > >>> company.
> > > >>>
> > > >>> Why I'm raising this now (and not earlier) is because so far
> keeping
> > > >>> the
> > > >>> tests running wasn't much of a problem; bump a few dependencies
> here
> > > >>> and
> > > >>> there and we're good to go.
> > > >>>
> > > >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> > > >>> which isn't supported by Jepsen out-of-the-box, completely breaking
> > the
> > > >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> > > >>> Jepsen (again, in Clojure). While I started working on that and
> could
> > > >>> likely finish it, I started to wonder whether it even makes sense
> to
> > do
> > > >>> so, and whether we couldn't invest this time elsewhere.
> > > >>>
> > > >>> Let me know what you think.
> > > >>>
> > > >>>
> > > >
> > >
> >
>


[jira] [Created] (FLINK-26065) org.apache.flink.table.api.PlanReference$ContentPlanReference $FilePlanReference $ResourcePlanReference violation the api rules

2022-02-09 Thread Yun Gao (Jira)
Yun Gao created FLINK-26065:
---

 Summary: 
org.apache.flink.table.api.PlanReference$ContentPlanReference 
$FilePlanReference $ResourcePlanReference violation the api rules
 Key: FLINK-26065
 URL: https://issues.apache.org/jira/browse/FLINK-26065
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / API
Affects Versions: 1.15.0
Reporter: Yun Gao



{code:java}
Feb 09 21:10:32 [ERROR] Failures: 
Feb 09 21:10:32 [ERROR]   Architecture Violation [Priority: MEDIUM] - Rule 
'Classes in API packages should have at least one API visibility annotation.' 
was violated (3 times):
Feb 09 21:10:32 org.apache.flink.table.api.PlanReference$ContentPlanReference 
does not satisfy: annotated with @Internal or annotated with @Experimental or 
annotated with @PublicEvolving or annotated with @Public or annotated with 
@Deprecated
Feb 09 21:10:32 org.apache.flink.table.api.PlanReference$FilePlanReference does 
not satisfy: annotated with @Internal or annotated with @Experimental or 
annotated with @PublicEvolving or annotated with @Public or annotated with 
@Deprecated
Feb 09 21:10:32 org.apache.flink.table.api.PlanReference$ResourcePlanReference 
does not satisfy: annotated with @Internal or annotated with @Experimental or 
annotated with @PublicEvolving or annotated with @Public or annotated with 
@Deprecated
Feb 09 21:10:32 [INFO] 
Feb 09 21:10:32 [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0
{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=31051&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=26427



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [VOTE] FLIP-212: Introduce Flink Kubernetes Operator

2022-02-09 Thread Thomas Weise
+1 (binding)

On Mon, Feb 7, 2022 at 1:10 AM Yangze Guo  wrote:
>
> +1 (binding)
>
> Best,
> Yangze Guo
>
> On Mon, Feb 7, 2022 at 5:04 PM K Fred  wrote:
> >
> > +1 (non-binding)
> >
> > Best Regards
> > Peng Yuan
> >
> > On Mon, Feb 7, 2022 at 4:49 PM Danny Cranmer 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks for this effort!
> > > Danny Cranmer
> > >
> > > On Mon, Feb 7, 2022 at 7:59 AM Biao Geng  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > Best,
> > > > Biao Geng
> > > >
> > > > Peter Huang  于2022年2月7日周一 14:31写道:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > >
> > > > > Best Regards
> > > > > Peter Huang
> > > > >
> > > > > On Sun, Feb 6, 2022 at 7:35 PM Yang Wang 
> > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > Best,
> > > > > > Yang
> > > > > >
> > > > > > Xintong Song  于2022年2月7日周一 10:25写道:
> > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 7, 2022 at 12:52 AM Márton Balassi <
> > > > > balassi.mar...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > On Sat, Feb 5, 2022 at 5:35 PM Israel Ekpo  > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am very excited to see this.
> > > > > > > > >
> > > > > > > > > Thanks for driving the effort
> > > > > > > > >
> > > > > > > > > +1 (non-binding)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Feb 5, 2022 at 10:53 AM Shqiprim Bunjaku <
> > > > > > > > > shqiprimbunj...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 (non-binding)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Feb 5, 2022 at 4:39 PM Chenya Zhang <
> > > > > > > > chenyazhangche...@gmail.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > >
> > > > > > > > > > > Thanks folks for leading this effort and making it happen
> > > so
> > > > > > fast!
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Chenya
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Feb 5, 2022 at 12:02 AM Gyula Fóra <
> > > > gyf...@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Thomas!
> > > > > > > > > > > >
> > > > > > > > > > > > +1 (binding) from my side
> > > > > > > > > > > >
> > > > > > > > > > > > Happy to see this effort getting some traction!
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Gyula
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, Feb 5, 2022 at 3:00 AM Thomas Weise <
> > > > t...@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'd like to start a vote on FLIP-212: Introduce Flink
> > > > > > > Kubernetes
> > > > > > > > > > > > > Operator [1] which has been discussed in [2].
> > > > > > > > > > > > >
> > > > > > > > > > > > > The vote will be open for at least 72 hours unless
> > > there
> > > > is
> > > > > > an
> > > > > > > > > > > > > objection or not enough votes.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > [2]
> > > > > > > > >
> > > https://lists.apache.org/thread/1z78t6rf70h45v7fbd2m93rm2y1bvh0z
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Israel Ekpo
> > > > > > > > > Lead Instructor, IzzyAcademy.com
> > > > > > > > > https://www.youtube.com/c/izzyacademy
> > > > > > > > > https://izzyacademy.com/
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >


[RESULT] [VOTE] FLIP-212: Introduce Flink Kubernetes Operator

2022-02-09 Thread Thomas Weise
Hi everyone,

FLIP-212 [1] has been accepted. There were 7 binding +1 votes and 6
non-binding +1 votes. No other votes.

binding:
Gyula Fóra
Márton Balassi
Xintong Song
Yang Wang
Danny Cranmer
Yangze Guo
Thomas Weise

non-binding:
Chenya Zhang
Shqiprim Bunjaku
Israel Ekpo
Peter Huang
Biao Geng
K Fred

Thank you all for voting, we are excited to take this forward. The
next step will be the creation of the repository.

Cheers,
Thomas

[1]  
https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator


On Mon, Feb 7, 2022 at 1:10 AM Yangze Guo  wrote:
>
> +1 (binding)
>
> Best,
> Yangze Guo
>
> On Mon, Feb 7, 2022 at 5:04 PM K Fred  wrote:
> >
> > +1 (non-binding)
> >
> > Best Regards
> > Peng Yuan
> >
> > On Mon, Feb 7, 2022 at 4:49 PM Danny Cranmer 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks for this effort!
> > > Danny Cranmer
> > >
> > > On Mon, Feb 7, 2022 at 7:59 AM Biao Geng  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > Best,
> > > > Biao Geng
> > > >
> > > > Peter Huang  于2022年2月7日周一 14:31写道:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > >
> > > > > Best Regards
> > > > > Peter Huang
> > > > >
> > > > > On Sun, Feb 6, 2022 at 7:35 PM Yang Wang 
> > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > Best,
> > > > > > Yang
> > > > > >
> > > > > > Xintong Song  于2022年2月7日周一 10:25写道:
> > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 7, 2022 at 12:52 AM Márton Balassi <
> > > > > balassi.mar...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > On Sat, Feb 5, 2022 at 5:35 PM Israel Ekpo  > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am very excited to see this.
> > > > > > > > >
> > > > > > > > > Thanks for driving the effort
> > > > > > > > >
> > > > > > > > > +1 (non-binding)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Feb 5, 2022 at 10:53 AM Shqiprim Bunjaku <
> > > > > > > > > shqiprimbunj...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 (non-binding)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Feb 5, 2022 at 4:39 PM Chenya Zhang <
> > > > > > > > chenyazhangche...@gmail.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > >
> > > > > > > > > > > Thanks folks for leading this effort and making it happen
> > > so
> > > > > > fast!
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Chenya
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Feb 5, 2022 at 12:02 AM Gyula Fóra <
> > > > gyf...@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Thomas!
> > > > > > > > > > > >
> > > > > > > > > > > > +1 (binding) from my side
> > > > > > > > > > > >
> > > > > > > > > > > > Happy to see this effort getting some traction!
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Gyula
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, Feb 5, 2022 at 3:00 AM Thomas Weise <
> > > > t...@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'd like to start a vote on FLIP-212: Introduce Flink
> > > > > > > Kubernetes
> > > > > > > > > > > > > Operator [1] which has been discussed in [2].
> > > > > > > > > > > > >
> > > > > > > > > > > > > The vote will be open for at least 72 hours unless
> > > there
> > > > is
> > > > > > an
> > > > > > > > > > > > > objection or not enough votes.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > [2]
> > > > > > > > >
> > > https://lists.apache.org/thread/1z78t6rf70h45v7fbd2m93rm2y1bvh0z
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Israel Ekpo
> > > > > > > > > Lead Instructor, IzzyAcademy.com
> > > > > > > > > https://www.youtube.com/c/izzyacademy
> > > > > > > > > https://izzyacademy.com/
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >


[jira] [Created] (FLINK-26066) Introduce FileStoreRead

2022-02-09 Thread Caizhi Weng (Jira)
Caizhi Weng created FLINK-26066:
---

 Summary: Introduce FileStoreRead
 Key: FLINK-26066
 URL: https://issues.apache.org/jira/browse/FLINK-26066
 Project: Flink
  Issue Type: Sub-task
  Components: Table Store
Reporter: Caizhi Weng
 Fix For: 1.15.0


Apart from {{FileStoreWrite}}, we also need a {{FileStoreRead}} operation to 
read actual key-values for a specific partition and bucket.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26067) ZooKeeperLeaderElectionConnectionHandlingTest. testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled failed due to timeout

2022-02-09 Thread Yun Gao (Jira)
Yun Gao created FLINK-26067:
---

 Summary: ZooKeeperLeaderElectionConnectionHandlingTest. 
testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled 
failed due to timeout
 Key: FLINK-26067
 URL: https://issues.apache.org/jira/browse/FLINK-26067
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Yun Gao



{code:java}
Feb 09 08:58:56 [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time 
elapsed: 18.67 s <<< FAILURE! - in 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionConnectionHandlingTest
Feb 09 08:58:56 [ERROR] 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled
  Time elapsed: 8.096 s  <<< ERROR!
Feb 09 08:58:56 java.util.concurrent.TimeoutException
Feb 09 08:58:56 at 
org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:106)
Feb 09 08:58:56 at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionConnectionHandlingTest$TestingContender.awaitRevokeLeadership(ZooKeeperLeaderElectionConnectionHandlingTest.java:211)
Feb 09 08:58:56 at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionConnectionHandlingTest.lambda$testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled$2(ZooKeeperLeaderElectionConnectionHandlingTest.java:100)
Feb 09 08:58:56 at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionConnectionHandlingTest.runTestWithZooKeeperConnectionProblem(ZooKeeperLeaderElectionConnectionHandlingTest.java:164)
Feb 09 08:58:56 at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionConnectionHandlingTest.runTestWithLostZooKeeperConnection(ZooKeeperLeaderElectionConnectionHandlingTest.java:109)
Feb 09 08:58:56 at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled(ZooKeeperLeaderElectionConnectionHandlingTest.java:96)
Feb 09 08:58:56 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
Feb 09 08:58:56 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Feb 09 08:58:56 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Feb 09 08:58:56 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 09 08:58:56 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
Feb 09 08:58:56 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
Feb 09 08:58:56 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
Feb 09 08:58:56 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
Feb 09 08:58:56 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
Feb 09 08:58:56 at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource$CloseableStatement.evaluate(TestingFatalErrorHandlerResource.java:94)
Feb 09 08:58:56 at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource$CloseableStatement.access$200(TestingFatalErrorHandlerResource.java:86)
Feb 09 08:58:56 at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource$1.evaluate(TestingFatalErrorHandlerResource.java:58)
Feb 09 08:58:56 at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
Feb 09 08:58:56 at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
Feb 09 08:58:56 at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
Feb 09 08:58:56 at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
Feb 09 08:58:56 at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
Feb 09 08:58:56 at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
Feb 09 08:58:56 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
Feb 09 08:58:56 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
Feb 09 08:58:56 at 
org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
Feb 09 08:58:56 at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
Feb 09 08:58:56 at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
Feb 09 08:58:56 at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
Feb 09 08:58:56 at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)

{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30979&view=logs&j=a57e0635-3fad-5b08-57c7-a4

[jira] [Created] (FLINK-26068) ZooKeeperJobGraphsStoreITCase.testPutAndRemoveJobGraph failed on azure

2022-02-09 Thread Yun Gao (Jira)
Yun Gao created FLINK-26068:
---

 Summary: ZooKeeperJobGraphsStoreITCase.testPutAndRemoveJobGraph 
failed on azure
 Key: FLINK-26068
 URL: https://issues.apache.org/jira/browse/FLINK-26068
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Yun Gao



{code:java}

Feb 09 13:41:24 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$BadVersionException:
 KeeperErrorCode = BadVersion for 
/flink/default/testPutAndRemoveJobGraph/372cd3c2dc2c8b3071d3f8fec2285fb9
Feb 09 13:41:24 at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
Feb 09 13:41:24 at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
Feb 09 13:41:24 at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2384)
Feb 09 13:41:24 at 
org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.SetDataBuilderImpl$7.call(SetDataBuilderImpl.java:398)
Feb 09 13:41:24 at 
org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.SetDataBuilderImpl$7.call(SetDataBuilderImpl.java:385)
Feb 09 13:41:24 at 
org.apache.flink.shaded.curator5.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
Feb 09 13:41:24 at 
org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:382)
Feb 09 13:41:24 at 
org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:358)
Feb 09 13:41:24 at 
org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:36)
Feb 09 13:41:24 at 
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.setStateHandle(ZooKeeperStateHandleStore.java:268)
Feb 09 13:41:24 at 
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.replace(ZooKeeperStateHandleStore.java:232)
Feb 09 13:41:24 at 
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.replace(ZooKeeperStateHandleStore.java:86)
Feb 09 13:41:24 at 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.putJobGraph(DefaultJobGraphStore.java:226)
Feb 09 13:41:24 at 
org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphsStoreITCase.testPutAndRemoveJobGraph(ZooKeeperJobGraphsStoreITCase.java:123)
Feb 09 13:41:24 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
Feb 09 13:41:24 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Feb 09 13:41:24 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Feb 09 13:41:24 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 09 13:41:24 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
Feb 09 13:41:24 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
Feb 09 13:41:24 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
Feb 09 13:41:24 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
Feb 09 13:41:24 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
Feb 09 13:41:24 at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
Feb 09 13:41:24 at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
Feb 09 13:41:24 at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
Feb 09 13:41:24 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
Feb 09 13:41:24 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
Feb 09 13:41:24 at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
Feb 09 13:41:24 at 
org.junit.runners.ParentRunner.run(ParentRunner.java

[jira] [Created] (FLINK-26069) KinesisFirehoseSinkITCase failed due to org.testcontainers.containers.ContainerLaunchException: Container startup failed

2022-02-09 Thread Yun Gao (Jira)
Yun Gao created FLINK-26069:
---

 Summary: KinesisFirehoseSinkITCase failed due to 
org.testcontainers.containers.ContainerLaunchException: Container startup failed
 Key: FLINK-26069
 URL: https://issues.apache.org/jira/browse/FLINK-26069
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kinesis
Affects Versions: 1.15.0
Reporter: Yun Gao



{code:java}
2022-02-09T20:52:36.6208358Z Feb 09 20:52:36 [ERROR] Picked up 
JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
2022-02-09T20:52:37.8270432Z Feb 09 20:52:37 [INFO] Running 
org.apache.flink.connector.firehose.sink.KinesisFirehoseSinkITCase
2022-02-09T20:54:08.9842331Z Feb 09 20:54:08 [ERROR] Tests run: 1, Failures: 0, 
Errors: 1, Skipped: 0, Time elapsed: 91.02 s <<< FAILURE! - in 
org.apache.flink.connector.firehose.sink.KinesisFirehoseSinkITCase
2022-02-09T20:54:08.9845140Z Feb 09 20:54:08 [ERROR] 
org.apache.flink.connector.firehose.sink.KinesisFirehoseSinkITCase  Time 
elapsed: 91.02 s  <<< ERROR!
2022-02-09T20:54:08.9847119Z Feb 09 20:54:08 
org.testcontainers.containers.ContainerLaunchException: Container startup failed
2022-02-09T20:54:08.9848834Z Feb 09 20:54:08at 
org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:336)
2022-02-09T20:54:08.9850502Z Feb 09 20:54:08at 
org.testcontainers.containers.GenericContainer.start(GenericContainer.java:317)
2022-02-09T20:54:08.9852012Z Feb 09 20:54:08at 
org.testcontainers.containers.GenericContainer.starting(GenericContainer.java:1066)
2022-02-09T20:54:08.9853695Z Feb 09 20:54:08at 
org.testcontainers.containers.FailureDetectingExternalResource$1.evaluate(FailureDetectingExternalResource.java:29)
2022-02-09T20:54:08.9855316Z Feb 09 20:54:08at 
org.junit.rules.RunRules.evaluate(RunRules.java:20)
2022-02-09T20:54:08.9856955Z Feb 09 20:54:08at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2022-02-09T20:54:08.9858330Z Feb 09 20:54:08at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413)
2022-02-09T20:54:08.9859838Z Feb 09 20:54:08at 
org.junit.runner.JUnitCore.run(JUnitCore.java:137)
2022-02-09T20:54:08.9861123Z Feb 09 20:54:08at 
org.junit.runner.JUnitCore.run(JUnitCore.java:115)
2022-02-09T20:54:08.9862747Z Feb 09 20:54:08at 
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42)
2022-02-09T20:54:08.9864691Z Feb 09 20:54:08at 
org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80)
2022-02-09T20:54:08.9866384Z Feb 09 20:54:08at 
org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72)
2022-02-09T20:54:08.9868138Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:107)
2022-02-09T20:54:08.9869980Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
2022-02-09T20:54:08.9871255Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54)
2022-02-09T20:54:08.9872602Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67)
2022-02-09T20:54:08.9874126Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52)
2022-02-09T20:54:08.9875899Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
2022-02-09T20:54:08.9877109Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
2022-02-09T20:54:08.9878367Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
2022-02-09T20:54:08.9879761Z Feb 09 20:54:08at 
org.junit.platform.launcher.core.SessionPerRequestLauncher.execute(SessionPerRequestLauncher.java:53)
2022-02-09T20:54:08.9881148Z Feb 09 20:54:08at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.execute(JUnitPlatformProvider.java:188)
2022-02-09T20:54:08.9882768Z Feb 09 20:54:08at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:154)
2022-02-09T20:54:08.9884214Z Feb 09 20:54:08at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124)
2022-02-09T20:54:08.9885475Z Feb 09 20:54:08at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
2022-02-09T20:54:08.9886856Z Feb 09 20:54:08at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
2022-02-09T20:54:08.9888037Z Feb 09 20:54:08at 
org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562

Re: Azure Pipelines are dealing with an incident, causing pipeline runs to fail

2022-02-09 Thread Martijn Visser
It looks like the issues haven't been resolved yet unfortunately.

On Wed, 9 Feb 2022 at 13:15, Robert Metzger  wrote:

> I filed a support request with Microsoft:
>
> https://developercommunity.visualstudio.com/t/Number-of-Microsoft-hosted-agents-droppe/1658827?from=email&space=21&sort=newest
>
> On Wed, Feb 9, 2022 at 1:04 PM Martijn Visser 
> wrote:
>
> > Unfortunately it looks like there are still failures. Will keep you
> posted
> >
> > On Wed, 9 Feb 2022 at 11:51, Martijn Visser 
> wrote:
> >
> > > Hi everyone,
> > >
> > > The issue should now be resolved.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Wed, 9 Feb 2022 at 10:55, Martijn Visser 
> > wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> Please keep in mind that Azure Pipelines currently is dealing with an
> > >> incident [1] which causes all CI pipeline runs on Azure to fail. When
> > the
> > >> incident has been resolved, it will be required to retrigger your
> > pipeline
> > >> to see if the pipeline then passes.
> > >>
> > >> Best regards,
> > >>
> > >> Martijn Visser
> > >> https://twitter.com/MartijnVisser82
> > >>
> > >> [1] https://status.dev.azure.com/_event/287959626
> > >>
> > >
> >
>