Re: [DISCUSS] FLIP-36 - Support Interactive Programming in Flink Table API
Hi folks, As the feature freeze of Flink 1.11 has passed and the release branch is cut, I'd like to revive this discussion thread of FLIP-36[1]. A quick summary of FLIP-36: The FLIP proposes to add support for interactive programming in Flink Table API. Specifically, it let users cache the intermediate results(tables) and use them in later jobs. Any comments or suggestions are very appreciated. I'd like to know how the community thinks about the FLIP. In the meantime, I plan to revive the vote thread if no comments are received in 2 days. Best, Xuannan [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-67%3A+Cluster+partitions+lifecycle On Thu, May 7, 2020 at 5:40 PM Xuannan Su wrote: > Hi, > > There are some feedbacks from @Timo and @Kurt in the voting thread for > FLIP-36 and I want to share my thoughts here. > > 1. How would the FLIP-36 look like after FLIP-84? > I don't think FLIP-84 will affect FLIP-36 from the public API perspective. > Users can call .cache on a table object and the cached table will be > generated whenever the table job is triggered to execute, either by > Table#executeInsert or StatementSet#execute. I think that FLIP-36 should > aware of the changes made by FLIP-84, but it shouldn't be a problem. At the > end of the day, FLIP-36 only requires the ability to add a sink to a node, > submit a table job with multiple sinks, and replace the cached table with a > source. > > 2. How can we support cache in a multi-statement SQL file? > The most intuitive way to support cache in a multi-statement SQL file is > by using a view, where the view is corresponding to a cached table. > > 3. Unifying the cached table and materialized views > It is true that the cached table and the materialized view are similar in > some way. However, I think the materialized view is a more complex concept. > First, a materialized view requires some kind of a refresh mechanism to > synchronize with the table. Secondly, the life cycle of a materialized view > is longer. The materialized view should be accessible even after the > application exits and should be accessible by another application, while > the cached table is only accessible in the application where it is created. > The cached table is introduced to avoid recomputation of an intermediate > table to support interactive programming in Flink Table API. And I think > the materialized view needs more discussion and certainly deserves a whole > new FLIP. > > Please let me know your thought. > > Best, > Xuannan > > On Wed, Apr 29, 2020 at 3:53 PM Xuannan Su wrote: > >> Hi folks, >> >> The FLIP-36 is updated according to the discussion with Becket. In the >> meantime, any comments are very welcome. >> >> If there are no further comments, I would like to start the voting >> thread by tomorrow. >> >> Thanks, >> Xuannan >> >> >> On Sun, Apr 26, 2020 at 9:34 AM Xuannan Su wrote: >> >>> Hi Becket, >>> >>> You are right. It makes sense to treat retry of job 2 as an ordinary >>> job. And the config does introduce some unnecessary confusion. Thank you >>> for you comment. I will update the FLIP. >>> >>> Best, >>> Xuannan >>> >>> On Sat, Apr 25, 2020 at 7:44 AM Becket Qin wrote: >>> Hi Xuannan, If user submits Job 1 and generated a cached intermediate result. And later on, user submitted job 2 which should ideally use the intermediate result. In that case, if job 2 failed due to missing the intermediate result, Job 2 should be retried with its full DAG. After that when Job 2 runs, it will also re-generate the cache. However, once job 2 has fell back to the original DAG, should it just be treated as an ordinary job that follow the recovery strategy? Having a separate configuration seems a little confusing. In another word, re-generating the cache is just a byproduct of running the full DAG of job 2, but is not the main purpose. It is just like when job 1 runs to generate cache, it does not have a separate config of retry to make sure the cache is generated. If it fails, it just fail like an ordinary job. What do you think? Thanks, Jiangjie (Becket) Qin On Fri, Apr 24, 2020 at 5:00 PM Xuannan Su wrote: > Hi Becket, > > The intermediate result will indeed be automatically re-generated by > resubmitting the original DAG. And that job could fail as well. In that > case, we need to decide if we should resubmit the original DAG to > re-generate the intermediate result or give up and throw an exception to > the user. And the config is to indicate how many resubmit should happen > before giving up. > > Thanks, > Xuannan > > On Fri, Apr 24, 2020 at 4:19 PM Becket Qin wrote: > > > Hi Xuannan, > > > > I am not entirely sure if I understand the cases you mentioned. The > users > > > can use the
[jira] [Created] (FLINK-17805) InputProcessorUtil doesn't handle indexes for multiple inputs operators correctly
Piotr Nowojski created FLINK-17805: -- Summary: InputProcessorUtil doesn't handle indexes for multiple inputs operators correctly Key: FLINK-17805 URL: https://issues.apache.org/jira/browse/FLINK-17805 Project: Flink Issue Type: Bug Components: Runtime / Network Reporter: Piotr Nowojski Fix For: 1.11.0 This can cause {{ArrayIndexOutOfBound}} exception when input gates passed to {{InputProcessorUtil#createCheckpointedInputGatePair}} have lower IDs in the second input compared to input gates from the first one. {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.flink.streaming.runtime.io.CheckpointBarrierUnaligner$ThreadSafeUnaligner.notifyBufferReceived(CheckpointBarrierUnaligner.java:328) at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer(LocalInputChannel.java:218) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:637) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:615) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:603) at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:105) at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:110) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:136) at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:178) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:341) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:206) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:196) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:553) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:526) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:713) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17806) CollectSinkFunctionTest.testCheckpointedFunction fails with expected:<50> but was:<0>
Robert Metzger created FLINK-17806: -- Summary: CollectSinkFunctionTest.testCheckpointedFunction fails with expected:<50> but was:<0> Key: FLINK-17806 URL: https://issues.apache.org/jira/browse/FLINK-17806 Project: Flink Issue Type: Bug Components: API / DataStream Affects Versions: 1.12.0 Reporter: Robert Metzger CI https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1793&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=4ed44b66-cdd6-5dcf-5f6a-88b07dda665d {code} 2020-05-19T04:28:24.6400821Z [INFO] Running org.apache.flink.streaming.runtime.io.benchmark.StreamNetworkBroadcastThroughputBenchmarkTest 2020-05-19T04:28:25.7508457Z java.util.concurrent.TimeoutException 2020-05-19T04:28:25.7511469Zat java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) 2020-05-19T04:28:25.7512331Zat java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) 2020-05-19T04:28:25.7513238Zat org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest.sendRequest(CollectSinkFunctionTest.java:391) 2020-05-19T04:28:25.7514280Zat org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest.access$300(CollectSinkFunctionTest.java:60) 2020-05-19T04:28:25.7515499Zat org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest$TestCollectRequestSender.sendRequest(CollectSinkFunctionTest.java:465) 2020-05-19T04:28:25.7516628Zat org.apache.flink.streaming.api.operators.collect.utils.TestCollectClient.run(TestCollectClient.java:77) 2020-05-19T04:28:25.7517231Z java.lang.InterruptedException 2020-05-19T04:28:25.7517907Zat java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) 2020-05-19T04:28:25.7518816Zat java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) 2020-05-19T04:28:25.7519745Zat org.apache.flink.streaming.api.operators.collect.CollectSinkFunction.invoke(CollectSinkFunction.java:246) 2020-05-19T04:28:25.7520773Zat org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest$CheckpointedDataFeeder.run(CollectSinkFunctionTest.java:574) 2020-05-19T04:28:25.7600809Z [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 53.526 s <<< FAILURE! - in org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest 2020-05-19T04:28:25.7602305Z [ERROR] testCheckpointedFunction(org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest) Time elapsed: 26.262 s <<< FAILURE! 2020-05-19T04:28:25.7603126Z java.lang.AssertionError: expected:<50> but was:<0> 2020-05-19T04:28:25.7603600Zat org.junit.Assert.fail(Assert.java:88) 2020-05-19T04:28:25.7604100Zat org.junit.Assert.failNotEquals(Assert.java:834) 2020-05-19T04:28:25.7604810Zat org.junit.Assert.assertEquals(Assert.java:645) 2020-05-19T04:28:25.7605373Zat org.junit.Assert.assertEquals(Assert.java:631) 2020-05-19T04:28:25.7606127Zat org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest.assertResultsEqual(CollectSinkFunctionTest.java:430) 2020-05-19T04:28:25.7607201Zat org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest.assertResultsEqualAfterSort(CollectSinkFunctionTest.java:441) 2020-05-19T04:28:25.7608296Zat org.apache.flink.streaming.api.operators.collect.CollectSinkFunctionTest.testCheckpointedFunction(CollectSinkFunctionTest.java:321) 2020-05-19T04:28:25.7609371Zat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2020-05-19T04:28:25.7610031Zat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2020-05-19T04:28:25.7610819Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-05-19T04:28:25.7611665Zat java.lang.reflect.Method.invoke(Method.java:498) 2020-05-19T04:28:25.7612348Zat org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) 2020-05-19T04:28:25.7613094Zat org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2020-05-19T04:28:25.7613785Zat org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) 2020-05-19T04:28:25.7614493Zat org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2020-05-19T04:28:25.7615370Zat org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 2020-05-19T04:28:25.7616067Zat org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 2020-05-19T04:28:25.7616711Zat org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) 2020-05-19T04:28:25.7617574Zat org.junit.rules.RunRules.evaluate(RunRules.java:20) 2020-05-19T04:28:25.7618159Zat
[jira] [Created] (FLINK-17807) Fix the broken link "/zh/ops/memory/mem_detail.html" in documentation
Jark Wu created FLINK-17807: --- Summary: Fix the broken link "/zh/ops/memory/mem_detail.html" in documentation Key: FLINK-17807 URL: https://issues.apache.org/jira/browse/FLINK-17807 Project: Flink Issue Type: Bug Components: Documentation Reporter: Jark Wu Assignee: Xintong Song We may need to update {{mem_setup_tm.zh.md}} and {{mem_trouble.zh.md}} to resolve the remaining broken link: http://localhost:4000/zh/ops/memory/mem_detail.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17808) Rename checkpoint meta file to "_metadata" until it has completed writing
Yun Tang created FLINK-17808: Summary: Rename checkpoint meta file to "_metadata" until it has completed writing Key: FLINK-17808 URL: https://issues.apache.org/jira/browse/FLINK-17808 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.10.0 Reporter: Yun Tang Fix For: 1.12.0 In practice, some developers or customers would use some strategy to find the recent _metadata as the checkpoint to recover (e.g as many proposals in FLINK-9043 suggest). However, there existed a "_meatadata" file does not mean the checkpoint have been completed as the writing to create the "_meatadata" file could break as some force quit (e.g. yarn application -kill). We could create the checkpoint meta stream to write data to file named as "_metadata.inprogress" and renamed it to "_metadata" once completed writing. By doing so, we could ensure the "_metadata" is not broken. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17809) BashJavaUtil script logic does not work for paths with spaces
Chesnay Schepler created FLINK-17809: Summary: BashJavaUtil script logic does not work for paths with spaces Key: FLINK-17809 URL: https://issues.apache.org/jira/browse/FLINK-17809 Project: Flink Issue Type: Bug Components: Deployment / Scripts Affects Versions: 1.10.0, 1.11.0 Reporter: Chesnay Schepler Assignee: Chesnay Schepler Fix For: 1.11.0, 1.10.2 Multiple paths aren't quoted (class path, conf_dir) resulting in errors if they contain spaces. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Need Help on Flink suitability to our usecase
Hi, I have the following usecase to implement in my organization. Say there is huge relational database(1000 tables for each of our 30k customers) in our monolith setup We want to reduce the load on the DB and prevent the applications from hitting it for latest events. So an extract is done from redo logs on to kafka. We need to set up a streaming platform based on the table updates that happen(read from kafka) , we need to form events and send it consumer. Each consumer may be interested in same table but different updates/columns respective of their business needs and then deliver it to their endpoint/kinesis/SQS/a kafka topic. So the case here is *1* table update : *m* events : *n* sink. Peak Load expected is easily a 100k-million table updates per second(all customers put together) Latency expected by most customers is less than a second. Mostly in 100-500ms. Is this usecase suited for flink ? I went through the Flink book and documentation. These are the following questions i have 1). If we have situation like this *1* table update : *m* events : *n* sink , is it better to write our micro service on our own or it it better to implement through flink. 1 a) How does checkpointing happens if we have *1* input: *n* output situations. 1 b) There are no heavy transformations maximum we might do is to check the required columns are present in the db updates and decide whether to create an event. So there is an alternative thought process to write a service in node since it more IO and less process. 2) I see that we are writing a Job and it is deployed and flink takes care of the rest in handling parallelism, latency and throughput. But what i need is to write a generic framework so that we should be able to handle any table structure. we should not end up writing one job driver for each case. There are at least 200 type of events in the existing monolith system which might move to this new system once built. 3) How do we maintain flink cluster HA . From the book , i get that internal task level failures are handled gracefully in flink. But what if the flink cluster goes down, how do we make sure its HA ? I had earlier worked with spark and we had issues managing it. (Not much problem was there since there the latency requirement is 15 min and we could make sure to ramp another one up within that time). These are absolute realtime cases and we cannot miss even one message/event. There are also thoughts whether to use kafka streams/apache storm for the same. [They are investigated by different set of folks] Thanks, Prasanna.
[jira] [Created] (FLINK-17810) Add document for K8s application mode
Yang Wang created FLINK-17810: - Summary: Add document for K8s application mode Key: FLINK-17810 URL: https://issues.apache.org/jira/browse/FLINK-17810 Project: Flink Issue Type: Sub-task Reporter: Yang Wang Add document for how to start/stop K8s application cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Google Season of Document
Dear aljoscha and sjwiesman, I am Hossam Elsafty a graduated from the Faculty of Engineering Computer and System Department. As one of my academic courses, I have finished Technical Reports Writing. Regarding my previous experience, I have done two internships as a Software Engineer including one in Valeo Egypt and the other in the Saudi Arabia Ministry of National Gaurd. Also, I am certified from Udacity for finishing the Machine Learning Engineer Nanodegree program. Currently, I am working as a Big Data Engineer for 1 year at Ejada System in Saudi Arabia, I worked in many projects for the Saudi Ministry of Finance using Cloudera Hadoop ecosystems. I am sending this mail to ask if I can apply to your organization in Google Season of Document. I am looking forward to your reply. Regards, Hossam Fawzy Elsafty Computer and System Department 2019 Faculty of Engineering ,Alexandria university
Discussion on Project Idea for Session Of docs 2020
Hello Aljoscha , Sjwiesman I am working on Big Data technologies and have hands-on experience on Flink, Spark, Kafka. These are some tasks done by me: 1. I did POC where I created a Docker image of Fink job and ran it on the K8S cluster on the local machine. Attaching my POC project: https://github.com/sanghisha145/flink_on_k8s 2. Even working on writing FlinkKafkaConsumer Job with autoscaling feature to reduce consumer lag I really find this project " Restructure the Table API & SQL Documentation " interesting and feel that I can contribute whole-heartedly to documentation on it. Let me know where I can start? PS: I have not written any open documentation but has written many for my organization(created many technical articles on confluence page) Thanks Divya
Re: [NOTICE] Azure Pipelines Status
Microsoft has not increased our capacity yet (even though it was promised to me yesterday again). I have now merged a hotfix disabling the e2e test execution on pull requests to have enough capacity on master. Please run e2e tests using your private Azure accounts. Thanks for your understanding! Best, Robert On Thu, May 14, 2020 at 11:23 AM Robert Metzger wrote: > Roughly speaking, I see the following problematic areas (I have initially > tried running the E2E tests on those machines): > > a) e2e tests starting Docker images (including Kubernetes). Since the > tests on the Ali infra are running in docker themselves, we need to adjust > the test scripts (which is not trivial, because both containers need to be > in the same network, and the volume mount paths are different) > > b) tests that modify the underlying file system: common_kubernetes.sh > installs stuff in "/usr/local/bin/". (Now that I think about it, it's not a > problem in the docker environment). > > c) Tests that don't clean up properly when failing. IIRC I saw leftover > docker containers by test_streaming_kinesis.sh when I was trying to run the > E2E tests on the Ali machines. > > And then there pull requests that propose changes to the e2e scripts that > mess something up :) > We certainly need to isolate the e2e test execution somehow. Maybe we > could launch VMs on the Ali machines for running the E2Es? (Using Vagrant) > > If Microsoft is not going to provide us with more test capacity, I will > evaluate other options for the E2E tests. > > > On Thu, May 14, 2020 at 10:36 AM Till Rohrmann > wrote: > >> Thanks for the update Robert. >> >> One idea to make the e2e also run on the Alibaba infrastructure would be >> to >> ensure that e2e tests clean up after they have run. Do we know which e2e >> tests don't do this properly? >> >> Cheers, >> Till >> >> On Thu, May 14, 2020 at 8:38 AM Robert Metzger >> wrote: >> >> > Hi all, >> > >> > tl;dr: I will have to cancel some E2E test executions of pull requests >> > because we have reached the capacity limit of Flink's Azure Pipelines >> > account. >> > >> > Long version: We have two types of agent pools in Azure Pipelines: >> > Microsoft-hosted VMs and Alibaba-hosted Docker environment. >> > In the Microsoft VMs, we are running the E2E tests, because we have an >> > environment that will always be destroyed after each execution (and the >> E2E >> > tests often leave dangling docker containers, processes etc.; and they >> > modify files in system directories) >> > In the Alibaba-hosted Docker environment, we are compiling and testing >> the >> > regular Maven tests. >> > >> > We only have 10 Microsoft-hosted VMs available, and each E2E execution >> > takes around 3.5 hours. That means we have a daily capacity of ~70 E2E >> > tests a day. >> > On Tuesday, we had 110 builds, on Wednesday 98 builds. >> > Because of this, I will (manually) cancel some E2E test executions for >> pull >> > requests. If I see that a PR is explicitly changing something on E2E >> tests, >> > I will keep it. If I see that a PR is a docs change, has other test >> > failures etc., I will cancel the E2E execution. >> > >> > If you want to verify that the E2E tests are passing for your own >> changes, >> > you can set up Azure Pipelines for your GitHub account, it's free and >> works >> > quite well. Here's a tutorial: >> > >> > >> https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository >> > >> > What can we do to avoid this situation in the future? >> > Sadly, Microsoft does not allow to buy additional processing slots for >> open >> > source projects [1]. However, I'm in touch with a product manager at >> > Microsoft who promised me (yesterday) to increase the limit for us. >> > >> > In the Alibaba environment, we have 80 slots available, and usually no >> > capacity constraints. This means we don't need to make compromises >> there. >> > >> > Sorry for this inconvenience. >> > >> > Best, >> > Robert >> > >> > PS: I'm considering keeping this thread as a permanent "status update" >> > thread for Azure Pipelines >> > >> > [1] >> > >> > >> https://developercommunity.visualstudio.com/content/problem/1028884/additionally-purchased-microsoft-hosted-build-agen.html >> > >> >
[jira] [Created] (FLINK-17811) Update docker hub Flink page
Andrey Zagrebin created FLINK-17811: --- Summary: Update docker hub Flink page Key: FLINK-17811 URL: https://issues.apache.org/jira/browse/FLINK-17811 Project: Flink Issue Type: Task Components: Deployment / Docker Reporter: Andrey Zagrebin In FLINK-17161, we refactored the Flink docker images docs. We should also update and possibly link the related Flink docs about docker integration in [docker hub Flink image description|https://hub.docker.com/_/flink?tab=description]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17812) Bundled reporters in plugins/ directory
Chesnay Schepler created FLINK-17812: Summary: Bundled reporters in plugins/ directory Key: FLINK-17812 URL: https://issues.apache.org/jira/browse/FLINK-17812 Project: Flink Issue Type: Improvement Components: Build System, Runtime / Metrics Reporter: Chesnay Schepler Assignee: Chesnay Schepler Fix For: 1.11.0 In FLINK-16963 we converted all metric reporters into plugins. Along with that we removed all relocations, as they are now unnecessary. However, users so far enabled reporters by moving them to /lib, which now could result in dependency clashes. To prevent this, and ease the overall setup, we could instead bundle all reporters in the plugins directory of the distribution, instead of opt. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17813) Manually test unaligned checkpoints on a cluster
Piotr Nowojski created FLINK-17813: -- Summary: Manually test unaligned checkpoints on a cluster Key: FLINK-17813 URL: https://issues.apache.org/jira/browse/FLINK-17813 Project: Flink Issue Type: Sub-task Components: Runtime / Checkpointing, Runtime / Network Reporter: Piotr Nowojski Assignee: Roman Khachatryan Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Google Season of Document
Hi Hossam, Please apply! We look forward to reviewing your application. Let us know if you have any questions. Seth On Tue, May 19, 2020 at 8:09 AM Hossam Elsafty wrote: > Dear aljoscha and sjwiesman, > > I am Hossam Elsafty a graduated from the Faculty of Engineering Computer > and System Department. As one of my academic courses, I have finished > Technical Reports Writing. > > Regarding my previous experience, I have done two internships as a > Software Engineer including one in Valeo Egypt and the other in the Saudi > Arabia Ministry of National Gaurd. Also, I am certified from Udacity for > finishing the Machine Learning Engineer Nanodegree program. > > Currently, I am working as a Big Data Engineer for 1 year at Ejada System > in Saudi Arabia, I worked in many projects for the Saudi Ministry of > Finance using Cloudera Hadoop ecosystems. > > I am sending this mail to ask if I can apply to your organization in > Google Season of Document. > I am looking forward to your reply. > > Regards, > Hossam Fawzy Elsafty > > Computer and System Department 2019 > Faculty of Engineering ,Alexandria university > >
Re: Discussion on Project Idea for Session Of docs 2020
Hi Divya, Thrilled to see your interest in the project. We are looking to work with someone to expand Flink SQL's reach. What we proposed on the blog is our initial idea's but we are open to any suggestions or modifications on the proposal if you have ideas about how it could be improved. I am happy to answer any specific questions, but otherwise, the first step is to apply via the GSoD website[1]. Seth [1] https://developers.google.com/season-of-docs/docs/tech-writer-guide On Tue, May 19, 2020 at 8:09 AM Divya Sanghi wrote: > Hello Aljoscha , Sjwiesman > > I am working on Big Data technologies and have hands-on experience on > Flink, Spark, Kafka. > > These are some tasks done by me: > 1. I did POC where I created a Docker image of Fink job and ran it on the > K8S cluster on the local machine. > Attaching my POC project: https://github.com/sanghisha145/flink_on_k8s > > 2. Even working on writing FlinkKafkaConsumer Job with autoscaling feature > to reduce consumer lag > > I really find this project " Restructure the Table API & SQL Documentation > " interesting and feel that I can contribute whole-heartedly to > documentation on it. > > Let me know where I can start? > > PS: I have not written any open documentation but has written many for my > organization(created many technical articles on confluence page) > > Thanks > Divya >
Re: Interested In Google Season of Docs!
Hi Roopal, *Restructure the Table API & SQL Documentation* Right now the Table / SQL documentation is very technical and dry to read. It is more like a technical specification than documentation that a new user could read. This project is to take the existing content and rewrite/reorganize it to make what the community already has more approachable. *Extend the Table API & SQL Documentation* This is to add brand new content. This could be tutorials and walkthroughs to help onboard new users or expanding an under-documented section such as Hive interop. These projects admittedly sound very similar and I will make a note to better explain them in future applications if the community applies next year. Please let me know if you have any more questions, Seth On Mon, May 18, 2020 at 2:08 AM Roopal Jain wrote: > Hi Marta, > > Thank you for responding to my email. I went through the link you attached. > I have a few questions: > 1. As per the https://flink.apache.org/news/2020/05/04/season-of-docs.html > , > two ideas are mentioned, one is restructuring the Table API & SQL > documentation and other is extending the same. I want to know how are these > two different? Can one person do both? > 2. You mentioned that FLIP-60 is very broad, should I start picking one > section from it, for example, Table API and start contributing content from > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/ in > the specified order in FLIP-60? Can you please guide me more on how should > I proceed ahead (choosing a specific area)? > > > Thanks & Regards, > Roopal Jain > > > On Sat, 16 May 2020 at 17:39, Marta Paes Moreira > wrote: > > > Hi, Roopal. > > > > Thanks for reaching out, we're glad to see that you're interested in > > giving the Flink docs a try! > > > > To participate in Google Season of Docs (GSoD), you'll have to submit an > > application once the application period opens (June 9th). Because FLIP-60 > > is very broad, the first step would be to think about what parts of it > > you'd like to focus on — or, what you'd like to have as a project > proposal > > [1]. > > > > You can find more information about the whole process of participating in > > GSoD in [2]. > > > > Let me know if I can help you with anything else or if you need support > > with choosing your focus areas. > > > > Marta > > > > [1] > > > https://developers.google.com/season-of-docs/docs/tech-writer-application-hints#project_proposal > > [2] > > > https://developers.google.com/season-of-docs/docs/tech-writer-application-hints > > > > On Sat, May 16, 2020 at 8:32 AM Roopal Jain wrote: > > > >> Hello Flink Dev Community! > >> > >> I am interested in participating in Google Season of Docs for Apache > >> Flink. > >> I went through the FLIP-60 detailed proposal and thought this is > something > >> I could do well. I am currently working as a software engineer and have > a > >> B.E in Computer Engineering from one of India's reputed engineering > >> colleges. I have prior open-source contribution with mentoring for > Google > >> Summer of Code and Google Code-In. > >> I have prior work experience on Apache Spark and a good grasp on SQL, > >> Java, > >> and Python. > >> Please guide me more on how to get started? > >> > >> Thanks & Regards, > >> Roopal Jain > >> > > >
[jira] [Created] (FLINK-17814) Translate native kubernetes document to Chinese
Yang Wang created FLINK-17814: - Summary: Translate native kubernetes document to Chinese Key: FLINK-17814 URL: https://issues.apache.org/jira/browse/FLINK-17814 Project: Flink Issue Type: Task Reporter: Yang Wang [https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html] Translate the native kubernetes document to Chinese. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17815) Change KafkaConnector to give per-partition metric group to WatermarkGenerator
Aljoscha Krettek created FLINK-17815: Summary: Change KafkaConnector to give per-partition metric group to WatermarkGenerator Key: FLINK-17815 URL: https://issues.apache.org/jira/browse/FLINK-17815 Project: Flink Issue Type: Sub-task Components: Connectors / Kafka Reporter: Aljoscha Krettek We currently give a reference to the general consumer {{MetricGroup}}, this means that all {{WatermarkGenerators}} would write to the same metric group. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17816) Change Latency Marker to work with "scheduleAtFixedDelay" instead of "scheduleAtFixedRate"
Stephan Ewen created FLINK-17816: Summary: Change Latency Marker to work with "scheduleAtFixedDelay" instead of "scheduleAtFixedRate" Key: FLINK-17816 URL: https://issues.apache.org/jira/browse/FLINK-17816 Project: Flink Issue Type: Bug Components: Runtime / Task Reporter: Stephan Ewen Latency Markers and other periodic timers are scheduled with {{scheduleAtFixedRate}}. That means every X time the callable is called. If it blocks (backpressure) is can be called immediately again. I would suggest to switch this to {{scheduleAtFixedDelay}} to avoid calling for a lot of latency marker injections when there is no way to actually execute the injection call. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17817) CollectResultFetcher fails with EOFException in AggregateReduceGroupingITCase
Robert Metzger created FLINK-17817: -- Summary: CollectResultFetcher fails with EOFException in AggregateReduceGroupingITCase Key: FLINK-17817 URL: https://issues.apache.org/jira/browse/FLINK-17817 Project: Flink Issue Type: Bug Components: API / DataStream Affects Versions: 1.11.0 Reporter: Robert Metzger CI: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1826&view=logs&j=e25d5e7e-2a9c-5589-4940-0b638d75a414&t=f83cd372-208c-5ec4-12a8-337462457129 {code} 2020-05-19T10:34:18.3224679Z [ERROR] testSingleAggOnTable_SortAgg(org.apache.flink.table.planner.runtime.batch.sql.agg.AggregateReduceGroupingITCase) Time elapsed: 7.537 s <<< ERROR! 2020-05-19T10:34:18.3225273Z java.lang.RuntimeException: Failed to fetch next result 2020-05-19T10:34:18.3227634Zat org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:92) 2020-05-19T10:34:18.3228518Zat org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:63) 2020-05-19T10:34:18.3229170Zat org.apache.flink.shaded.guava18.com.google.common.collect.Iterators.addAll(Iterators.java:361) 2020-05-19T10:34:18.3229863Zat org.apache.flink.shaded.guava18.com.google.common.collect.Lists.newArrayList(Lists.java:160) 2020-05-19T10:34:18.3230586Zat org.apache.flink.table.planner.runtime.utils.BatchTestBase.executeQuery(BatchTestBase.scala:300) 2020-05-19T10:34:18.3231303Zat org.apache.flink.table.planner.runtime.utils.BatchTestBase.check(BatchTestBase.scala:141) 2020-05-19T10:34:18.3231996Zat org.apache.flink.table.planner.runtime.utils.BatchTestBase.checkResult(BatchTestBase.scala:107) 2020-05-19T10:34:18.3232847Zat org.apache.flink.table.planner.runtime.batch.sql.agg.AggregateReduceGroupingITCase.testSingleAggOnTable(AggregateReduceGroupingITCase.scala:176) 2020-05-19T10:34:18.3233694Zat org.apache.flink.table.planner.runtime.batch.sql.agg.AggregateReduceGroupingITCase.testSingleAggOnTable_SortAgg(AggregateReduceGroupingITCase.scala:122) 2020-05-19T10:34:18.3234461Zat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2020-05-19T10:34:18.3234983Zat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2020-05-19T10:34:18.3235632Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-05-19T10:34:18.3236615Zat java.lang.reflect.Method.invoke(Method.java:498) 2020-05-19T10:34:18.3237256Zat org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) 2020-05-19T10:34:18.3237965Zat org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2020-05-19T10:34:18.3238750Zat org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) 2020-05-19T10:34:18.3239314Zat org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2020-05-19T10:34:18.3239838Zat org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 2020-05-19T10:34:18.3240362Zat org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 2020-05-19T10:34:18.3240803Zat org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) 2020-05-19T10:34:18.3243624Zat org.junit.rules.RunRules.evaluate(RunRules.java:20) 2020-05-19T10:34:18.3244531Zat org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 2020-05-19T10:34:18.3245325Zat org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) 2020-05-19T10:34:18.3246086Zat org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) 2020-05-19T10:34:18.3246765Zat org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 2020-05-19T10:34:18.3247390Zat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 2020-05-19T10:34:18.3248012Zat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 2020-05-19T10:34:18.3248779Zat org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 2020-05-19T10:34:18.3249417Zat org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 2020-05-19T10:34:18.3250357Zat org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) 2020-05-19T10:34:18.3251021Zat org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) 2020-05-19T10:34:18.3251597Zat org.junit.rules.RunRules.evaluate(RunRules.java:20) 2020-05-19T10:34:18.3252141Zat org.junit.runners.ParentRunner.run(ParentRunner.java:363) 2020-05-19T10:34:18.3252798Zat org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) 2020-05-19T10:34:18.3253527Zat org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) 202
Re: [NOTICE] Azure Pipelines Status
Microsoft has now doubled our CI capacity (to 20 concurrent VMs for executing e2e tests). If the e2e test execution is normalized tomorrow, I will revert the hotfix, enabling e2e tests on PRs again. Sorry for the back and forth. On Tue, May 19, 2020 at 3:11 PM Robert Metzger wrote: > Microsoft has not increased our capacity yet (even though it was promised > to me yesterday again). > > I have now merged a hotfix disabling the e2e test execution on pull > requests to have enough capacity on master. > Please run e2e tests using your private Azure accounts. Thanks for your > understanding! > > Best, > Robert > > > On Thu, May 14, 2020 at 11:23 AM Robert Metzger > wrote: > >> Roughly speaking, I see the following problematic areas (I have initially >> tried running the E2E tests on those machines): >> >> a) e2e tests starting Docker images (including Kubernetes). Since the >> tests on the Ali infra are running in docker themselves, we need to adjust >> the test scripts (which is not trivial, because both containers need to be >> in the same network, and the volume mount paths are different) >> >> b) tests that modify the underlying file system: common_kubernetes.sh >> installs stuff in "/usr/local/bin/". (Now that I think about it, it's not a >> problem in the docker environment). >> >> c) Tests that don't clean up properly when failing. IIRC I saw leftover >> docker containers by test_streaming_kinesis.sh when I was trying to run the >> E2E tests on the Ali machines. >> >> And then there pull requests that propose changes to the e2e scripts that >> mess something up :) >> We certainly need to isolate the e2e test execution somehow. Maybe we >> could launch VMs on the Ali machines for running the E2Es? (Using Vagrant) >> >> If Microsoft is not going to provide us with more test capacity, I will >> evaluate other options for the E2E tests. >> >> >> On Thu, May 14, 2020 at 10:36 AM Till Rohrmann >> wrote: >> >>> Thanks for the update Robert. >>> >>> One idea to make the e2e also run on the Alibaba infrastructure would be >>> to >>> ensure that e2e tests clean up after they have run. Do we know which e2e >>> tests don't do this properly? >>> >>> Cheers, >>> Till >>> >>> On Thu, May 14, 2020 at 8:38 AM Robert Metzger >>> wrote: >>> >>> > Hi all, >>> > >>> > tl;dr: I will have to cancel some E2E test executions of pull requests >>> > because we have reached the capacity limit of Flink's Azure Pipelines >>> > account. >>> > >>> > Long version: We have two types of agent pools in Azure Pipelines: >>> > Microsoft-hosted VMs and Alibaba-hosted Docker environment. >>> > In the Microsoft VMs, we are running the E2E tests, because we have an >>> > environment that will always be destroyed after each execution (and >>> the E2E >>> > tests often leave dangling docker containers, processes etc.; and they >>> > modify files in system directories) >>> > In the Alibaba-hosted Docker environment, we are compiling and testing >>> the >>> > regular Maven tests. >>> > >>> > We only have 10 Microsoft-hosted VMs available, and each E2E execution >>> > takes around 3.5 hours. That means we have a daily capacity of ~70 E2E >>> > tests a day. >>> > On Tuesday, we had 110 builds, on Wednesday 98 builds. >>> > Because of this, I will (manually) cancel some E2E test executions for >>> pull >>> > requests. If I see that a PR is explicitly changing something on E2E >>> tests, >>> > I will keep it. If I see that a PR is a docs change, has other test >>> > failures etc., I will cancel the E2E execution. >>> > >>> > If you want to verify that the E2E tests are passing for your own >>> changes, >>> > you can set up Azure Pipelines for your GitHub account, it's free and >>> works >>> > quite well. Here's a tutorial: >>> > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository >>> > >>> > What can we do to avoid this situation in the future? >>> > Sadly, Microsoft does not allow to buy additional processing slots for >>> open >>> > source projects [1]. However, I'm in touch with a product manager at >>> > Microsoft who promised me (yesterday) to increase the limit for us. >>> > >>> > In the Alibaba environment, we have 80 slots available, and usually no >>> > capacity constraints. This means we don't need to make compromises >>> there. >>> > >>> > Sorry for this inconvenience. >>> > >>> > Best, >>> > Robert >>> > >>> > PS: I'm considering keeping this thread as a permanent "status update" >>> > thread for Azure Pipelines >>> > >>> > [1] >>> > >>> > >>> https://developercommunity.visualstudio.com/content/problem/1028884/additionally-purchased-microsoft-hosted-build-agen.html >>> > >>> >>
[jira] [Created] (FLINK-17818) CSV Reader with Pojo Type and no field names fails
Danish Amjad created FLINK-17818: Summary: CSV Reader with Pojo Type and no field names fails Key: FLINK-17818 URL: https://issues.apache.org/jira/browse/FLINK-17818 Project: Flink Issue Type: Bug Components: API / DataSet Affects Versions: 1.10.1 Reporter: Danish Amjad When a file is read with a CSVReader and a POJO is specified and the filed names are not specified, the output is obviously not correct. The API is not throwing any error despite a null check inside the API. i.e. {code} Preconditions.checkNotNull(pojoFields, "POJO fields must be specified (not null) if output type is a POJO."); {code} The *root cause* of the problem is that the _fieldNames_ argument is a variable argument and the variable is 'empty but not null' when not given any value. So, the check passes without failing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17819) Yarn error unhelpful when forgetting HADOOP_CLASSPATH
Arvid Heise created FLINK-17819: --- Summary: Yarn error unhelpful when forgetting HADOOP_CLASSPATH Key: FLINK-17819 URL: https://issues.apache.org/jira/browse/FLINK-17819 Project: Flink Issue Type: Improvement Components: Deployment / YARN Reporter: Arvid Heise When running {code:bash} flink run -m yarn-cluster -yjm 1768 -ytm 50072 -ys 32 ... {code} without some export HADOOP_CLASSPATH, we get the unhelpful message {noformat} Could not build the program from JAR file: JAR file does not exist: -yjm {noformat} I'd expect something like {noformat} yarn-cluster can only be used with exported HADOOP_CLASSPATH, see for more information{noformat} I suggest to load a stub for YarnCluster deployment if the actual implementation fails to load, which prints this error when used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17820) Memory threshold is ignored for channel state
Roman Khachatryan created FLINK-17820: - Summary: Memory threshold is ignored for channel state Key: FLINK-17820 URL: https://issues.apache.org/jira/browse/FLINK-17820 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing, Runtime / Task Affects Versions: 1.11.0 Reporter: Roman Khachatryan Assignee: Roman Khachatryan Fix For: 1.11.0 Config parameter state.backend.fs.memory-threshold is ignored for channel state. Causing each subtask to have a file per checkpoint. Regardless of the size of channel state (of this subtask). This also causes slow cleanup and delays the next checkpoint. The problem is that {{ChannelStateCheckpointWriter.finishWriteAndResult}} calls flush(); which actually flushes the data on disk. >From FSDataOutputStream.flush Javadoc: A completed flush does not mean that the data is necessarily persistent. Data persistence can is only assumed after calls to close() or sync(). Possible solutions: 1. not to flush in {{ChannelStateCheckpointWriter.finishWriteAndResult (which can lead to data loss in a wrapping stream).}} {{2. change }}{{FsCheckpointStateOutputStream.flush behavior}} {{3. wrap }}{{FsCheckpointStateOutputStream to prevent flush}}{{}}{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17821) Kafka010TableITCase>KafkaTableTestBase.testKafkaSourceSink failed on AZP
Zhu Zhu created FLINK-17821: --- Summary: Kafka010TableITCase>KafkaTableTestBase.testKafkaSourceSink failed on AZP Key: FLINK-17821 URL: https://issues.apache.org/jira/browse/FLINK-17821 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.12.0 Reporter: Zhu Zhu https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1871&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=34f486e1-e1e4-5dd2-9c06-bfdd9b9c74a8&l=12032 2020-05-19T16:29:40.7239430Z Test testKafkaSourceSink[legacy = false, topicId = 1](org.apache.flink.streaming.connectors.kafka.table.Kafka010TableITCase) failed with: 2020-05-19T16:29:40.7240291Z java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. 2020-05-19T16:29:40.7241033Zat java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 2020-05-19T16:29:40.7241542Zat java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) 2020-05-19T16:29:40.7242127Zat org.apache.flink.table.planner.runtime.utils.TableEnvUtil$.execInsertSqlAndWaitResult(TableEnvUtil.scala:31) 2020-05-19T16:29:40.7242729Zat org.apache.flink.table.planner.runtime.utils.TableEnvUtil.execInsertSqlAndWaitResult(TableEnvUtil.scala) 2020-05-19T16:29:40.7243239Zat org.apache.flink.streaming.connectors.kafka.table.KafkaTableTestBase.testKafkaSourceSink(KafkaTableTestBase.java:145) 2020-05-19T16:29:40.7243691Zat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2020-05-19T16:29:40.7244273Zat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2020-05-19T16:29:40.7244729Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-05-19T16:29:40.7245117Zat java.lang.reflect.Method.invoke(Method.java:498) 2020-05-19T16:29:40.7245515Zat org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) 2020-05-19T16:29:40.7245956Zat org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2020-05-19T16:29:40.7246419Zat org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) 2020-05-19T16:29:40.7246870Zat org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2020-05-19T16:29:40.7247287Zat org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) 2020-05-19T16:29:40.7251320Zat org.junit.rules.RunRules.evaluate(RunRules.java:20) 2020-05-19T16:29:40.7251833Zat org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 2020-05-19T16:29:40.7252251Zat org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) 2020-05-19T16:29:40.7252716Zat org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) 2020-05-19T16:29:40.7253117Zat org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 2020-05-19T16:29:40.7253502Zat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 2020-05-19T16:29:40.7254041Zat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 2020-05-19T16:29:40.7254528Zat org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 2020-05-19T16:29:40.7255500Zat org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 2020-05-19T16:29:40.7256064Zat org.junit.runners.ParentRunner.run(ParentRunner.java:363) 2020-05-19T16:29:40.7256438Zat org.junit.runners.Suite.runChild(Suite.java:128) 2020-05-19T16:29:40.7256758Zat org.junit.runners.Suite.runChild(Suite.java:27) 2020-05-19T16:29:40.7257118Zat org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 2020-05-19T16:29:40.7257486Zat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 2020-05-19T16:29:40.7257885Zat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 2020-05-19T16:29:40.7258389Zat org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 2020-05-19T16:29:40.7258821Zat org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 2020-05-19T16:29:40.7259219Zat org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 2020-05-19T16:29:40.7259664Zat org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 2020-05-19T16:29:40.7260098Zat org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) 2020-05-19T16:29:40.7260635Zat org.junit.rules.RunRules.evaluate(RunRules.java:20) 2020-05-19T16:29:40.7261065Zat org.junit.runners.ParentRunner.run(ParentRunner.java:363) 2020-05-19T16:29:40.7261467Zat org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) 2020-05-19T16:29:40.7261952Zat org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:27
[jira] [Created] (FLINK-17822) Flink CLI end-to-end test failed with "JavaGcCleanerWrapper$PendingCleanersRunner cannot access class jdk.internal.misc.SharedSecrets" in JDK 11
Dian Fu created FLINK-17822: --- Summary: Flink CLI end-to-end test failed with "JavaGcCleanerWrapper$PendingCleanersRunner cannot access class jdk.internal.misc.SharedSecrets" in JDK 11 Key: FLINK-17822 URL: https://issues.apache.org/jira/browse/FLINK-17822 Project: Flink Issue Type: Bug Reporter: Dian Fu Instance: https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/1887/logs/600 {code} 2020-05-19T21:59:39.8829043Z 2020-05-19 21:59:25,193 ERROR org.apache.flink.util.JavaGcCleanerWrapper [] - FATAL UNEXPECTED - Failed to invoke waitForReferenceProcessing 2020-05-19T21:59:39.8829849Z java.lang.IllegalAccessException: class org.apache.flink.util.JavaGcCleanerWrapper$PendingCleanersRunner cannot access class jdk.internal.misc.SharedSecrets (in module java.base) because module java.base does not export jdk.internal.misc to unnamed module @54e3658c 2020-05-19T21:59:39.8830707Zat jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:361) ~[?:?] 2020-05-19T21:59:39.8831166Zat java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:591) ~[?:?] 2020-05-19T21:59:39.8831744Zat java.lang.reflect.Method.invoke(Method.java:558) ~[?:?] 2020-05-19T21:59:39.8832596Zat org.apache.flink.util.JavaGcCleanerWrapper$PendingCleanersRunner.getJavaLangRefAccess(JavaGcCleanerWrapper.java:362) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8833667Zat org.apache.flink.util.JavaGcCleanerWrapper$PendingCleanersRunner.tryRunPendingCleaners(JavaGcCleanerWrapper.java:351) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8834712Zat org.apache.flink.util.JavaGcCleanerWrapper$CleanerManager.tryRunPendingCleaners(JavaGcCleanerWrapper.java:207) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8835686Zat org.apache.flink.util.JavaGcCleanerWrapper.tryRunPendingCleaners(JavaGcCleanerWrapper.java:158) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8836652Zat org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:94) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8838033Zat org.apache.flink.runtime.memory.UnsafeMemoryBudget.verifyEmpty(UnsafeMemoryBudget.java:64) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8839259Zat org.apache.flink.runtime.memory.MemoryManager.verifyEmpty(MemoryManager.java:172) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8840148Zat org.apache.flink.runtime.taskexecutor.slot.TaskSlot.verifyMemoryFreed(TaskSlot.java:311) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8841035Zat org.apache.flink.runtime.taskexecutor.slot.TaskSlot.lambda$closeAsync$1(TaskSlot.java:301) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8841603Zat java.util.concurrent.CompletableFuture.uniRunNow(CompletableFuture.java:815) ~[?:?] 2020-05-19T21:59:39.8842069Zat java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:799) ~[?:?] 2020-05-19T21:59:39.8842844Zat java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2121) ~[?:?] 2020-05-19T21:59:39.8843828Zat org.apache.flink.runtime.taskexecutor.slot.TaskSlot.closeAsync(TaskSlot.java:300) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8844790Zat org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl.freeSlotInternal(TaskSlotTableImpl.java:404) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8845754Zat org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl.freeSlot(TaskSlotTableImpl.java:365) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8846842Zat org.apache.flink.runtime.taskexecutor.TaskExecutor.freeSlotInternal(TaskExecutor.java:1589) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8847711Zat org.apache.flink.runtime.taskexecutor.TaskExecutor.freeSlot(TaskExecutor.java:967) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8848295Zat jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] 2020-05-19T21:59:39.8848732Zat jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?] 2020-05-19T21:59:39.8849228Zat jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] 2020-05-19T21:59:39.8849669Zat java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] 2020-05-19T21:59:39.8850656Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] 2020-05-19T21:59:39.8851589Zat
[jira] [Created] (FLINK-17823) Resolve the race condition while releasing RemoteInputChannel
Zhijiang created FLINK-17823: Summary: Resolve the race condition while releasing RemoteInputChannel Key: FLINK-17823 URL: https://issues.apache.org/jira/browse/FLINK-17823 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.11.0 Reporter: Zhijiang Assignee: Zhijiang Fix For: 1.11.0 RemoteInputChannel#releaseAllResources might be called by canceler thread. Meanwhile, the task thread can also call RemoteInputChannel#getNextBuffer. There probably cause two potential problems: * Task thread might get null buffer after canceler thread already released all the buffers, then it might cause misleading NPE in getNextBuffer. * Task thread and canceler thread might pull the same buffer concurrently, which causes unexpected exception when the same buffer is recycled twice. The solution is to properly synchronize the buffer queue in release method to avoid the same buffer pulled by both canceler thread and task thread. And in getNextBuffer method, we add some explicit checks to avoid misleading NPE and hint some valid exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17824) "Resuming Savepoint" e2e stalls indefinitely
Robert Metzger created FLINK-17824: -- Summary: "Resuming Savepoint" e2e stalls indefinitely Key: FLINK-17824 URL: https://issues.apache.org/jira/browse/FLINK-17824 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing, Tests Reporter: Robert Metzger CI; https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1887&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=94459a52-42b6-5bfc-5d74-690b5d3c6de8 {code} 2020-05-19T21:05:52.9696236Z == 2020-05-19T21:05:52.9696860Z Running 'Resuming Savepoint (file, async, scale down) end-to-end test' 2020-05-19T21:05:52.9697243Z == 2020-05-19T21:05:52.9713094Z TEST_DATA_DIR: /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-52970362751 2020-05-19T21:05:53.1194478Z Flink dist directory: /home/vsts/work/1/s/flink-dist/target/flink-1.12-SNAPSHOT-bin/flink-1.12-SNAPSHOT 2020-05-19T21:05:53.2180375Z Starting cluster. 2020-05-19T21:05:53.9986167Z Starting standalonesession daemon on host fv-az558. 2020-05-19T21:05:55.5997224Z Starting taskexecutor daemon on host fv-az558. 2020-05-19T21:05:55.6223837Z Waiting for Dispatcher REST endpoint to come up... 2020-05-19T21:05:57.0552482Z Waiting for Dispatcher REST endpoint to come up... 2020-05-19T21:05:57.9446865Z Waiting for Dispatcher REST endpoint to come up... 2020-05-19T21:05:59.0098434Z Waiting for Dispatcher REST endpoint to come up... 2020-05-19T21:06:00.0569710Z Dispatcher REST endpoint is up. 2020-05-19T21:06:07.7099937Z Job (a92a74de8446a80403798bb4806b73f3) is running. 2020-05-19T21:06:07.7855906Z Waiting for job to process up to 200 records, current progress: 114 records ... 2020-05-19T21:06:55.5755111Z 2020-05-19T21:06:55.5756550Z 2020-05-19T21:06:55.5757225Z The program finished with the following exception: 2020-05-19T21:06:55.5757566Z 2020-05-19T21:06:55.5765453Z org.apache.flink.util.FlinkException: Could not stop with a savepoint job "a92a74de8446a80403798bb4806b73f3". 2020-05-19T21:06:55.5766873Zat org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:485) 2020-05-19T21:06:55.5767980Zat org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:854) 2020-05-19T21:06:55.5769014Zat org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:477) 2020-05-19T21:06:55.5770052Zat org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:921) 2020-05-19T21:06:55.5771107Zat org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:982) 2020-05-19T21:06:55.5772223Zat org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) 2020-05-19T21:06:55.5773325Zat org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:982) 2020-05-19T21:06:55.5774871Z Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. 2020-05-19T21:06:55.5777183Zat java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 2020-05-19T21:06:55.5778884Zat java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) 2020-05-19T21:06:55.5779920Zat org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:483) 2020-05-19T21:06:55.5781175Z... 6 more 2020-05-19T21:06:55.5782391Z Caused by: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. 2020-05-19T21:06:55.5783885Zat org.apache.flink.runtime.scheduler.SchedulerBase.lambda$stopWithSavepoint$9(SchedulerBase.java:890) 2020-05-19T21:06:55.5784992Zat java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) 2020-05-19T21:06:55.5786492Zat java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) 2020-05-19T21:06:55.5787601Zat java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) 2020-05-19T21:06:55.5788682Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) 2020-05-19T21:06:55.5790308Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) 2020-05-19T21:06:55.5791664Zat org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) 2020-05-19T21:06:55.5792767Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) 2020-05-19T21:06:55.5793756Zat akka.jap
Re: MODERATE for dev@flink.apache.org
Hi, Looks like you are trying to send email to Apache Flink dev@ mailing list without subscribing yet. Please subscribe to the dev mailing list [1] to be able to see the reply and follow up with your question. Thanks, Henry On behalf of Apache Flink PMC [1] https://flink.apache.org/community.html#how-to-subscribe-to-a-mailing-list > > > > > -- Forwarded message -- > From: vishalovercome > To: dev@flink.apache.org > Cc: > Bcc: > Date: Tue, 19 May 2020 22:13:39 -0700 (MST) > Subject: Issues using StreamingFileSink (Cannot locate configuration: > tried hadoop-metrics2-s3a-file-system.properties, > hadoop-metrics2.properties) > I've written a Flink job to stream updates to S3 using StreamingFileSink. I > used the presto plugin for checkpointing and the hadoop plugin for writing > part files to S3. I was going through my application logs and found these > messages which I couldn't understand and would like to know if there's some > step I might have missed. > Even if these messages are harmless, I would like to understand what these > mean. > 2020-05-19 19:23:33,277 INFO > org.apache.flink.fs.shaded.hadoop3.org.apache.commons.beanutils.FluentPropertyBeanIntrospector > > - Error when creating PropertyDescriptor for public final void > > org.apache.flink.fs.shaded.hadoop3.org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object)! > Ignoring this property. > *2020-05-19 19:23:33,310 WARN > org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.metrics2.impl.MetricsConfig > > - Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties* > 2020-05-19 19:23:33,397 INFO > org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.metrics2.impl.MetricsSystemImpl > > - Scheduled Metric snapshot period at 10 second(s). > 2020-05-19 19:23:33,397 INFO > org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.metrics2.impl.MetricsSystemImpl > > - s3a-file-system metrics system started > 2020-05-19 19:23:34,833 INFO > org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.conf.Configuration.deprecation > > - fs.s3a.server-side-encryption-key is deprecated. Instead, use > fs.s3a.server-side-encryption.key > > > > -- > Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > >
[jira] [Created] (FLINK-17825) HA end-to-end gets killed due to timeout
Robert Metzger created FLINK-17825: -- Summary: HA end-to-end gets killed due to timeout Key: FLINK-17825 URL: https://issues.apache.org/jira/browse/FLINK-17825 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Tests Reporter: Robert Metzger Assignee: Robert Metzger CI: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1867&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 {code} 2020-05-19T20:46:50.9034002Z Killed TM @ 104061 2020-05-19T20:47:05.8510180Z Killed TM @ 107775 2020-05-19T20:47:55.1181475Z Killed TM @ 108337 2020-05-19T20:48:16.7907005Z Test (pid: 89099) did not finish after 540 seconds. 2020-05-19T20:48:16.790Z Printing Flink logs and killing it: [...] 2020-05-19T20:48:19.1016912Z /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_ha_datastream.sh: line 125: 89099 Terminated ( cmdpid=$BASHPID; ( sleep $TEST_TIMEOUT_SECONDS; echo "Test (pid: $cmdpid) did not finish after $TEST_TIMEOUT_SECONDS seconds."; echo "Printing Flink logs and killing it:"; cat ${FLINK_DIR}/log/*; kill "$cmdpid" ) & watchdog_pid=$!; echo $watchdog_pid > $TEST_DATA_DIR/job_watchdog.pid; run_ha_test 4 ${STATE_BACKEND_TYPE} ${STATE_BACKEND_FILE_ASYNC} ${STATE_BACKEND_ROCKS_INCREMENTAL} ${ZOOKEEPER_VERSION} ) 2020-05-19T20:48:19.1017985Z Stopping job timeout watchdog (with pid=89100) 2020-05-19T20:48:19.1018621Z /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_ha_datastream.sh: line 112: kill: (89100) - No such process 2020-05-19T20:48:19.1019000Z Killing JM watchdog @ 91127 2020-05-19T20:48:19.1019199Z Killing TM watchdog @ 91883 2020-05-19T20:48:19.1019424Z [FAIL] Test script contains errors. 2020-05-19T20:48:19.1019639Z Checking of logs skipped. 2020-05-19T20:48:19.1019785Z 2020-05-19T20:48:19.1020329Z [FAIL] 'Running HA (rocks, non-incremental) end-to-end test' failed after 9 minutes and 0 seconds! Test exited with exit code 1 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)