[GitHub] aljoscha commented on issue #6784: [FLINK-7811] Add support for Scala 2.12
aljoscha commented on issue #6784: [FLINK-7811] Add support for Scala 2.12 URL: https://github.com/apache/flink/pull/6784#issuecomment-431645254 Also, we won't merge it like this but have to think about how to integrate the Scala 2.12 build. This PR only builds Scala 2.12, for testing, without building Scala 2.11. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7811) Add support for Scala 2.12
[ https://issues.apache.org/jira/browse/FLINK-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658116#comment-16658116 ] ASF GitHub Bot commented on FLINK-7811: --- aljoscha commented on issue #6784: [FLINK-7811] Add support for Scala 2.12 URL: https://github.com/apache/flink/pull/6784#issuecomment-431645254 Also, we won't merge it like this but have to think about how to integrate the Scala 2.12 build. This PR only builds Scala 2.12, for testing, without building Scala 2.11. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add support for Scala 2.12 > -- > > Key: FLINK-7811 > URL: https://issues.apache.org/jira/browse/FLINK-7811 > Project: Flink > Issue Type: Sub-task > Components: Scala API >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10599) Provide documentation for the modern kafka connector
[ https://issues.apache.org/jira/browse/FLINK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated FLINK-10599: - Component/s: Documentation > Provide documentation for the modern kafka connector > > > Key: FLINK-10599 > URL: https://issues.apache.org/jira/browse/FLINK-10599 > Project: Flink > Issue Type: Sub-task > Components: Documentation, Kafka Connector >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Fix For: 1.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10609) Source distribution tests fails with logging ClassCastException
[ https://issues.apache.org/jira/browse/FLINK-10609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658166#comment-16658166 ] Till Rohrmann commented on FLINK-10609: --- This should also apply to the current master, right [~Zentol]? > Source distribution tests fails with logging ClassCastException > --- > > Key: FLINK-10609 > URL: https://issues.apache.org/jira/browse/FLINK-10609 > Project: Flink > Issue Type: Bug > Components: Build System, Tests >Affects Versions: 1.5.5, 1.6.2 >Reporter: Chesnay Schepler >Assignee: Chesnay Schepler >Priority: Major > Fix For: 1.5.5, 1.6.2 > > > [~fhueske] and [~dawidwys] have reported a test error while running the tests > for the source distribution of {{1.5.5-rc1}} and {{1.6.2-rc1}}. > 1.6: > {code:java} > ERROR StatusLogger Unable to create class > org.apache.flink.streaming.connectors.elasticsearch5.shaded.org.apache.logging.slf4j.SLF4JLoggerContextFactory > specified in > file:/home/fhueske/Downloads/flink-1.6.2/flink-connectors/flink-connector-elasticsearch5/target/classes/META-INF/log4j-provider.properties > java.lang.ClassNotFoundException: > org.apache.flink.streaming.connectors.elasticsearch5.shaded.org.apache.logging.slf4j.SLF4JLoggerContextFactory > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.logging.log4j.spi.Provider.loadLoggerContextFactory(Provider.java:96) > at org.apache.logging.log4j.LogManager.(LogManager.java:91) > at > org.elasticsearch.common.logging.ESLoggerFactory.getLogger(ESLoggerFactory.java:49) > at org.elasticsearch.common.logging.Loggers.getLogger(Loggers.java:105) > at org.elasticsearch.node.Node.(Node.java:237) > at > org.apache.flink.streaming.connectors.elasticsearch.EmbeddedElasticsearchNodeEnvironmentImpl$PluginNode.(EmbeddedElasticsearchNodeEnvironmentImpl.java:78) > at > org.apache.flink.streaming.connectors.elasticsearch.EmbeddedElasticsearchNodeEnvironmentImpl.start(EmbeddedElasticsearchNodeEnvironmentImpl.java:54) > at > org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.prepare(ElasticsearchSinkTestBase.java:72) > > {code} > 1.5: > {code:java} > Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; > support was removed in 8.0 > Running org.apache.flink.addons.hbase.HBaseConnectorITCase > Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.557 sec <<< > FAILURE! - in org.apache.flink.addons.hbase.HBaseConnectorITCase > org.apache.flink.addons.hbase.HBaseConnectorITCase Time elapsed: 0.556 sec > <<< ERROR! > java.lang.ClassCastException: > org.apache.commons.logging.impl.SLF4JLocationAwareLog cannot be cast to > org.apache.commons.logging.impl.Log4JLogger > org.apache.flink.addons.hbase.HBaseConnectorITCase Time elapsed: 0.557 sec > <<< ERROR! > java.lang.NullPointerException{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] yanghua opened a new pull request #6889: [FLINK-10599][Documentation] Provide documentation for the modern kafka connector
yanghua opened a new pull request #6889: [FLINK-10599][Documentation] Provide documentation for the modern kafka connector URL: https://github.com/apache/flink/pull/6889 ## What is the purpose of the change *This pull request provides documentation for the modern kafka connector* ## Brief change log - *Provide documentation for the modern kafka connector* ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (not applicable / **docs** / JavaDocs / not documented) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-10599) Provide documentation for the modern kafka connector
[ https://issues.apache.org/jira/browse/FLINK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658169#comment-16658169 ] ASF GitHub Bot commented on FLINK-10599: yanghua opened a new pull request #6889: [FLINK-10599][Documentation] Provide documentation for the modern kafka connector URL: https://github.com/apache/flink/pull/6889 ## What is the purpose of the change *This pull request provides documentation for the modern kafka connector* ## Brief change log - *Provide documentation for the modern kafka connector* ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (not applicable / **docs** / JavaDocs / not documented) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Provide documentation for the modern kafka connector > > > Key: FLINK-10599 > URL: https://issues.apache.org/jira/browse/FLINK-10599 > Project: Flink > Issue Type: Sub-task > Components: Documentation, Kafka Connector >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10599) Provide documentation for the modern kafka connector
[ https://issues.apache.org/jira/browse/FLINK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-10599: --- Labels: pull-request-available (was: ) > Provide documentation for the modern kafka connector > > > Key: FLINK-10599 > URL: https://issues.apache.org/jira/browse/FLINK-10599 > Project: Flink > Issue Type: Sub-task > Components: Documentation, Kafka Connector >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] yanghua commented on issue #6889: [FLINK-10599][Documentation] Provide documentation for the modern kafka connector
yanghua commented on issue #6889: [FLINK-10599][Documentation] Provide documentation for the modern kafka connector URL: https://github.com/apache/flink/pull/6889#issuecomment-431659542 Hi @aljoscha I provided a simple introduction and documentation for the modern kafka connector. Considering that I am not a native English user, you are welcome to point out where the documentation is inappropriate. Also, if it is missing something, pelase feel free to tell me. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-10599) Provide documentation for the modern kafka connector
[ https://issues.apache.org/jira/browse/FLINK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658171#comment-16658171 ] ASF GitHub Bot commented on FLINK-10599: yanghua commented on issue #6889: [FLINK-10599][Documentation] Provide documentation for the modern kafka connector URL: https://github.com/apache/flink/pull/6889#issuecomment-431659542 Hi @aljoscha I provided a simple introduction and documentation for the modern kafka connector. Considering that I am not a native English user, you are welcome to point out where the documentation is inappropriate. Also, if it is missing something, pelase feel free to tell me. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Provide documentation for the modern kafka connector > > > Key: FLINK-10599 > URL: https://issues.apache.org/jira/browse/FLINK-10599 > Project: Flink > Issue Type: Sub-task > Components: Documentation, Kafka Connector >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10603) Reduce kafka test duration
[ https://issues.apache.org/jira/browse/FLINK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658177#comment-16658177 ] vinoyang commented on FLINK-10603: -- The reason for this issue is due to the increase in the timeout of testFailOnNoBroker (from 60 seconds to 120 seconds). The timeout for extending the method was originally because it needed access to the Kafka client API: partitionsFor. In kafka client 2.0 its default timeout is 60s. According to the official description of the relevant configuration, the timeout period is determined by the configuration item (max.block.ms). So, we tried to reduce the configuration to 30s. > Reduce kafka test duration > -- > > Key: FLINK-10603 > URL: https://issues.apache.org/jira/browse/FLINK-10603 > Project: Flink > Issue Type: Sub-task > Components: Kafka Connector, Tests >Affects Versions: 1.7.0 >Reporter: Chesnay Schepler >Assignee: vinoyang >Priority: Major > > The tests for the modern kafka connector take more than 10 minutes which is > simply unacceptable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-10603) Reduce kafka test duration
[ https://issues.apache.org/jira/browse/FLINK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658177#comment-16658177 ] vinoyang edited comment on FLINK-10603 at 10/21/18 11:33 AM: - The reason for this issue is due to the increase in the timeout of testFailOnNoBroker (from 60 seconds to 120 seconds). The timeout for extending the method was originally because it needed access to the Kafka client API: partitionsFor. In kafka client 2.0 its default timeout is 60s. According to the official description of the relevant configuration, the timeout period is determined by the configuration item ([#max.block.ms]https://kafka.apache.org/documentation/). So, we tried to reduce the configuration to 30s. was (Author: yanghua): The reason for this issue is due to the increase in the timeout of testFailOnNoBroker (from 60 seconds to 120 seconds). The timeout for extending the method was originally because it needed access to the Kafka client API: partitionsFor. In kafka client 2.0 its default timeout is 60s. According to the official description of the relevant configuration, the timeout period is determined by the configuration item (max.block.ms). So, we tried to reduce the configuration to 30s. > Reduce kafka test duration > -- > > Key: FLINK-10603 > URL: https://issues.apache.org/jira/browse/FLINK-10603 > Project: Flink > Issue Type: Sub-task > Components: Kafka Connector, Tests >Affects Versions: 1.7.0 >Reporter: Chesnay Schepler >Assignee: vinoyang >Priority: Major > > The tests for the modern kafka connector take more than 10 minutes which is > simply unacceptable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-10603) Reduce kafka test duration
[ https://issues.apache.org/jira/browse/FLINK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658177#comment-16658177 ] vinoyang edited comment on FLINK-10603 at 10/21/18 11:34 AM: - The reason for this issue is due to the increase in the timeout of testFailOnNoBroker (from 60 seconds to 120 seconds). The timeout for extending the method was originally because it needed access to the Kafka client API: partitionsFor. In kafka client 2.0 its default timeout is 60s. According to the official description of the relevant configuration, the timeout period is determined by the configuration item ([max.block.ms|https://kafka.apache.org/documentation/]). So, we tried to reduce the configuration to 30s. was (Author: yanghua): The reason for this issue is due to the increase in the timeout of testFailOnNoBroker (from 60 seconds to 120 seconds). The timeout for extending the method was originally because it needed access to the Kafka client API: partitionsFor. In kafka client 2.0 its default timeout is 60s. According to the official description of the relevant configuration, the timeout period is determined by the configuration item ([#max.block.ms]https://kafka.apache.org/documentation/). So, we tried to reduce the configuration to 30s. > Reduce kafka test duration > -- > > Key: FLINK-10603 > URL: https://issues.apache.org/jira/browse/FLINK-10603 > Project: Flink > Issue Type: Sub-task > Components: Kafka Connector, Tests >Affects Versions: 1.7.0 >Reporter: Chesnay Schepler >Assignee: vinoyang >Priority: Major > > The tests for the modern kafka connector take more than 10 minutes which is > simply unacceptable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10603) Reduce kafka test duration
[ https://issues.apache.org/jira/browse/FLINK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-10603: --- Labels: pull-request-available (was: ) > Reduce kafka test duration > -- > > Key: FLINK-10603 > URL: https://issues.apache.org/jira/browse/FLINK-10603 > Project: Flink > Issue Type: Sub-task > Components: Kafka Connector, Tests >Affects Versions: 1.7.0 >Reporter: Chesnay Schepler >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > > The tests for the modern kafka connector take more than 10 minutes which is > simply unacceptable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] yanghua opened a new pull request #6890: [FLINK-10603] Reduce kafka test duration
yanghua opened a new pull request #6890: [FLINK-10603] Reduce kafka test duration URL: https://github.com/apache/flink/pull/6890 ## What is the purpose of the change *This pull request reduces kafka test duration* ## Brief change log - *Reduce kafka test duration* ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (not applicable / docs / JavaDocs / **not documented**) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-10603) Reduce kafka test duration
[ https://issues.apache.org/jira/browse/FLINK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658180#comment-16658180 ] ASF GitHub Bot commented on FLINK-10603: yanghua opened a new pull request #6890: [FLINK-10603] Reduce kafka test duration URL: https://github.com/apache/flink/pull/6890 ## What is the purpose of the change *This pull request reduces kafka test duration* ## Brief change log - *Reduce kafka test duration* ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (not applicable / docs / JavaDocs / **not documented**) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Reduce kafka test duration > -- > > Key: FLINK-10603 > URL: https://issues.apache.org/jira/browse/FLINK-10603 > Project: Flink > Issue Type: Sub-task > Components: Kafka Connector, Tests >Affects Versions: 1.7.0 >Reporter: Chesnay Schepler >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > > The tests for the modern kafka connector take more than 10 minutes which is > simply unacceptable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10622) Add end-to-end test for time versioned joins
Till Rohrmann created FLINK-10622: - Summary: Add end-to-end test for time versioned joins Key: FLINK-10622 URL: https://issues.apache.org/jira/browse/FLINK-10622 Project: Flink Issue Type: Sub-task Components: Table API & SQL, Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 We should add an end-to-end test to test the new time versioned joins functionality. Maybe we can add this to the existing Stream SQL end-to-end test {{test_streaming_sql.sh}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10623) Extend streaming SQL end-to-end test to test MATCH_RECOGNIZE
Till Rohrmann created FLINK-10623: - Summary: Extend streaming SQL end-to-end test to test MATCH_RECOGNIZE Key: FLINK-10623 URL: https://issues.apache.org/jira/browse/FLINK-10623 Project: Flink Issue Type: Sub-task Components: Table API & SQL, Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 We should extend the existing {{test_streaming_sql.sh}} to test the newly added {{MATCH_RECOGNIZE}} functionality. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10624) Extend SQL client end-to-end to test new KafkaTableSink
Till Rohrmann created FLINK-10624: - Summary: Extend SQL client end-to-end to test new KafkaTableSink Key: FLINK-10624 URL: https://issues.apache.org/jira/browse/FLINK-10624 Project: Flink Issue Type: Task Components: Kafka Connector, Table API & SQL, Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 With FLINK-9697, we added support for Kafka 2.0. We should also extend the existing streaming client end-to-end test to also test the new {{KafkaTableSink}} against Kafka 2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10625) Add MATCH_RECOGNIZE documentation
Till Rohrmann created FLINK-10625: - Summary: Add MATCH_RECOGNIZE documentation Key: FLINK-10625 URL: https://issues.apache.org/jira/browse/FLINK-10625 Project: Flink Issue Type: Sub-task Components: Documentation, Table API & SQL Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 The newly added {{MATCH_RECOGNIZE}} functionality needs to be documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10626) Add documentation for time versioned joins
Till Rohrmann created FLINK-10626: - Summary: Add documentation for time versioned joins Key: FLINK-10626 URL: https://issues.apache.org/jira/browse/FLINK-10626 Project: Flink Issue Type: Sub-task Components: Documentation, Table API & SQL Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 The Flink documentation should be updated to cover the newly added functionality. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10627) Extend test_streaming_file_sink to test S3 writer
Till Rohrmann created FLINK-10627: - Summary: Extend test_streaming_file_sink to test S3 writer Key: FLINK-10627 URL: https://issues.apache.org/jira/browse/FLINK-10627 Project: Flink Issue Type: Sub-task Components: filesystem-connector, Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 We should extend the {{test_streaming_file_sink.sh}} to test the new S3 recoverable writer in an integrated way. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10627) Extend test_streaming_file_sink to test S3 writer
[ https://issues.apache.org/jira/browse/FLINK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-10627: -- Description: We should extend the {{test_streaming_file_sink.sh}} to test the new S3 recoverable writer in an integrated way. The test should cover the fault recovery. Moreover it should check for resource cleanup. was:We should extend the {{test_streaming_file_sink.sh}} to test the new S3 recoverable writer in an integrated way. > Extend test_streaming_file_sink to test S3 writer > - > > Key: FLINK-10627 > URL: https://issues.apache.org/jira/browse/FLINK-10627 > Project: Flink > Issue Type: Sub-task > Components: filesystem-connector, Tests >Affects Versions: 1.7.0 >Reporter: Till Rohrmann >Priority: Critical > Fix For: 1.7.0 > > > We should extend the {{test_streaming_file_sink.sh}} to test the new S3 > recoverable writer in an integrated way. > The test should cover the fault recovery. Moreover it should check for > resource cleanup. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10628) Add end-to-end test for REST communication encryption
Till Rohrmann created FLINK-10628: - Summary: Add end-to-end test for REST communication encryption Key: FLINK-10628 URL: https://issues.apache.org/jira/browse/FLINK-10628 Project: Flink Issue Type: Sub-task Components: REST Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 With FLINK-10371 we added support for mutual authentication for the REST communication. We should adapt one of the existing end-to-end tests to require this feature for the REST communication. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-2485) Handle removal of Java Unsafe in Java9
[ https://issues.apache.org/jira/browse/FLINK-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658231#comment-16658231 ] Benoit MERIAUX commented on FLINK-2485: --- Hi, I did a quick search on usage of {code:java}sun.misc.unsafe{code} and there is actually 39 occurences, mainly in core module which concentrates 34 hits, and notably memory package with 26 occurences. there is a replacement solution with {code:java}VarHandle{code} starting from jdk9, so it is not retrocompatible with < jdk9 Unsafe are still included in jdk 9 and 10, and removed from 11. So this issue is not on the critical path to build with jdk 9 and 10. > Handle removal of Java Unsafe in Java9 > -- > > Key: FLINK-2485 > URL: https://issues.apache.org/jira/browse/FLINK-2485 > Project: Flink > Issue Type: Task > Components: Core >Reporter: Henry Saputra >Priority: Major > > With potential Oracle will remove Java Unsafe[1] from Java9 we need to make > sure we have upgrade path for Apache Flink > [1] > https://docs.google.com/document/d/1GDm_cAxYInmoHMor-AkStzWvwE9pw6tnz_CebJQxuUE/edit?pli=1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-2485) Handle removal of Java Unsafe in Java9
[ https://issues.apache.org/jira/browse/FLINK-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658251#comment-16658251 ] Henry Saputra commented on FLINK-2485: -- Agree. Will reduce the importance of this issue > Handle removal of Java Unsafe in Java9 > -- > > Key: FLINK-2485 > URL: https://issues.apache.org/jira/browse/FLINK-2485 > Project: Flink > Issue Type: Task > Components: Core >Reporter: Henry Saputra >Priority: Major > > With potential Oracle will remove Java Unsafe[1] from Java9 we need to make > sure we have upgrade path for Apache Flink > [1] > https://docs.google.com/document/d/1GDm_cAxYInmoHMor-AkStzWvwE9pw6tnz_CebJQxuUE/edit?pli=1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-2485) Handle removal of Java Unsafe
[ https://issues.apache.org/jira/browse/FLINK-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Saputra updated FLINK-2485: - Priority: Minor (was: Major) Description: With potential Oracle will remove Java Unsafe[1] from future Java we need to make sure we have upgrade path for Apache Flink [1] [https://docs.google.com/document/d/1GDm_cAxYInmoHMor-AkStzWvwE9pw6tnz_CebJQxuUE/edit?pli=1] was: With potential Oracle will remove Java Unsafe[1] from Java9 we need to make sure we have upgrade path for Apache Flink [1] https://docs.google.com/document/d/1GDm_cAxYInmoHMor-AkStzWvwE9pw6tnz_CebJQxuUE/edit?pli=1 Summary: Handle removal of Java Unsafe (was: Handle removal of Java Unsafe in Java9) > Handle removal of Java Unsafe > - > > Key: FLINK-2485 > URL: https://issues.apache.org/jira/browse/FLINK-2485 > Project: Flink > Issue Type: Task > Components: Core >Reporter: Henry Saputra >Priority: Minor > > With potential Oracle will remove Java Unsafe[1] from future Java we need to > make sure we have upgrade path for Apache Flink > [1] > [https://docs.google.com/document/d/1GDm_cAxYInmoHMor-AkStzWvwE9pw6tnz_CebJQxuUE/edit?pli=1] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] tillrohrmann commented on issue #6876: [FLINK-10251] Handle oversized response messages in AkkaRpcActor
tillrohrmann commented on issue #6876: [FLINK-10251] Handle oversized response messages in AkkaRpcActor URL: https://github.com/apache/flink/pull/6876#issuecomment-431698563 The problem is that you always try to serialize the result. However, it should only be serialized if the sender is remote. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-10251) Handle oversized response messages in AkkaRpcActor
[ https://issues.apache.org/jira/browse/FLINK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658356#comment-16658356 ] ASF GitHub Bot commented on FLINK-10251: tillrohrmann commented on issue #6876: [FLINK-10251] Handle oversized response messages in AkkaRpcActor URL: https://github.com/apache/flink/pull/6876#issuecomment-431698563 The problem is that you always try to serialize the result. However, it should only be serialized if the sender is remote. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Handle oversized response messages in AkkaRpcActor > -- > > Key: FLINK-10251 > URL: https://issues.apache.org/jira/browse/FLINK-10251 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.5.3, 1.6.0, 1.7.0 >Reporter: Till Rohrmann >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > Fix For: 1.5.6, 1.6.3, 1.7.0 > > > The {{AkkaRpcActor}} should check whether an RPC response which is sent to a > remote sender does not exceed the maximum framesize of the underlying > {{ActorSystem}}. If this is the case we should fail fast instead. We can > achieve this by serializing the response and sending the serialized byte > array. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10558) Port YARNHighAvailabilityITCase and YARNSessionCapacitySchedulerITCase to new code base
[ https://issues.apache.org/jira/browse/FLINK-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658357#comment-16658357 ] Till Rohrmann commented on FLINK-10558: --- I think in the future we might add a minimum number of running TMs to the session cluster. This number will make sure that at least so many TMs are always running. If we have this feature, then we don't need to submit a job to make the test work. > Port YARNHighAvailabilityITCase and YARNSessionCapacitySchedulerITCase to new > code base > --- > > Key: FLINK-10558 > URL: https://issues.apache.org/jira/browse/FLINK-10558 > Project: Flink > Issue Type: Sub-task >Reporter: vinoyang >Assignee: TisonKun >Priority: Minor > > {{YARNHighAvailabilityITCase}}, > {{YARNSessionCapacitySchedulerITCase#testClientStartup,}} > {{YARNSessionCapacitySchedulerITCase#testTaskManagerFailure}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10527) Cleanup constant isNewMode in YarnTestBase
[ https://issues.apache.org/jira/browse/FLINK-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658358#comment-16658358 ] ASF GitHub Bot commented on FLINK-10527: tillrohrmann closed pull request #6816: [FLINK-10527] Cleanup constant isNewMode in YarnTestBase URL: https://github.com/apache/flink/pull/6816 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java b/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java index f027399be7f..fb2c6ccf2b0 100644 --- a/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java +++ b/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java @@ -132,28 +132,6 @@ public void testDetachedMode() throws InterruptedException, IOException { // before checking any strings outputted by the CLI, first give it time to return clusterRunner.join(); - if (!isNewMode) { - checkForLogString("The Flink YARN client has been started in detached mode"); - - // in legacy mode we have to wait until the TMs are up until we can submit the job - LOG.info("Waiting until two containers are running"); - // wait until two containers are running - while (getRunningContainers() < 2) { - sleep(500); - } - - // additional sleep for the JM/TM to start and establish connection - long startTime = System.nanoTime(); - while (System.nanoTime() - startTime < TimeUnit.NANOSECONDS.convert(10, TimeUnit.SECONDS) && - !(verifyStringsInNamedLogFiles( - new String[]{"YARN Application Master started"}, "jobmanager.log") && - verifyStringsInNamedLogFiles( - new String[]{"Starting TaskManager actor"}, "taskmanager.log"))) { - LOG.info("Still waiting for JM/TM to initialize..."); - sleep(500); - } - } - // actually run a program, otherwise we wouldn't necessarily see any TaskManagers // be brought up Runner jobRunner = startWithArgs(new String[]{"run", @@ -163,14 +141,12 @@ public void testDetachedMode() throws InterruptedException, IOException { jobRunner.join(); - if (isNewMode) { - // in "new" mode we can only wait after the job is submitted, because TMs - // are spun up lazily - LOG.info("Waiting until two containers are running"); - // wait until two containers are running - while (getRunningContainers() < 2) { - sleep(500); - } + // in "new" mode we can only wait after the job is submitted, because TMs + // are spun up lazily + LOG.info("Waiting until two containers are running"); + // wait until two containers are running + while (getRunningContainers() < 2) { + sleep(500); } // make sure we have two TMs running in either mode This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Cleanup constant isNewMode in YarnTestBase > -- > > Key: FLINK-10527 > URL: https://issues.apache.org/jira/browse/FLINK-10527 > Project: Flink > Issue Type: Sub-task > Components: YARN >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > > This seems to be a residual problem with FLINK-10396. It is set to true in > that PR. Currently it has three usage scenarios: > 1. assert, caused an error > {code:java} > assumeTrue("The new mode does not start TMs upfront.", !isNewMode); > {code} > 2. if (!isNewMode) the logic in the block would not have invoked, the if > block can be removed > 3. if (isNewMode) always been invok
[jira] [Created] (FLINK-10629) FlinkKafkaProducerITCase.testFlinkKafkaProducer10FailTransactionCoordinatorBeforeNotify failed on Travis
Till Rohrmann created FLINK-10629: - Summary: FlinkKafkaProducerITCase.testFlinkKafkaProducer10FailTransactionCoordinatorBeforeNotify failed on Travis Key: FLINK-10629 URL: https://issues.apache.org/jira/browse/FLINK-10629 Project: Flink Issue Type: Bug Components: Kafka Connector, Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 The {{FlinkKafkaProducerITCase.testFlinkKafkaProducer10FailTransactionCoordinatorBeforeNotify}} failed on Travis. https://api.travis-ci.org/v3/job/443777257/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] tillrohrmann closed pull request #6816: [FLINK-10527] Cleanup constant isNewMode in YarnTestBase
tillrohrmann closed pull request #6816: [FLINK-10527] Cleanup constant isNewMode in YarnTestBase URL: https://github.com/apache/flink/pull/6816 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java b/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java index f027399be7f..fb2c6ccf2b0 100644 --- a/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java +++ b/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java @@ -132,28 +132,6 @@ public void testDetachedMode() throws InterruptedException, IOException { // before checking any strings outputted by the CLI, first give it time to return clusterRunner.join(); - if (!isNewMode) { - checkForLogString("The Flink YARN client has been started in detached mode"); - - // in legacy mode we have to wait until the TMs are up until we can submit the job - LOG.info("Waiting until two containers are running"); - // wait until two containers are running - while (getRunningContainers() < 2) { - sleep(500); - } - - // additional sleep for the JM/TM to start and establish connection - long startTime = System.nanoTime(); - while (System.nanoTime() - startTime < TimeUnit.NANOSECONDS.convert(10, TimeUnit.SECONDS) && - !(verifyStringsInNamedLogFiles( - new String[]{"YARN Application Master started"}, "jobmanager.log") && - verifyStringsInNamedLogFiles( - new String[]{"Starting TaskManager actor"}, "taskmanager.log"))) { - LOG.info("Still waiting for JM/TM to initialize..."); - sleep(500); - } - } - // actually run a program, otherwise we wouldn't necessarily see any TaskManagers // be brought up Runner jobRunner = startWithArgs(new String[]{"run", @@ -163,14 +141,12 @@ public void testDetachedMode() throws InterruptedException, IOException { jobRunner.join(); - if (isNewMode) { - // in "new" mode we can only wait after the job is submitted, because TMs - // are spun up lazily - LOG.info("Waiting until two containers are running"); - // wait until two containers are running - while (getRunningContainers() < 2) { - sleep(500); - } + // in "new" mode we can only wait after the job is submitted, because TMs + // are spun up lazily + LOG.info("Waiting until two containers are running"); + // wait until two containers are running + while (getRunningContainers() < 2) { + sleep(500); } // make sure we have two TMs running in either mode This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Resolved] (FLINK-10527) Cleanup constant isNewMode in YarnTestBase
[ https://issues.apache.org/jira/browse/FLINK-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-10527. --- Resolution: Fixed Fix Version/s: 1.7.0 Fixed via https://github.com/apache/flink/commit/7b040b915504e59243c642b1f4a84c956d96d134 > Cleanup constant isNewMode in YarnTestBase > -- > > Key: FLINK-10527 > URL: https://issues.apache.org/jira/browse/FLINK-10527 > Project: Flink > Issue Type: Sub-task > Components: YARN >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > This seems to be a residual problem with FLINK-10396. It is set to true in > that PR. Currently it has three usage scenarios: > 1. assert, caused an error > {code:java} > assumeTrue("The new mode does not start TMs upfront.", !isNewMode); > {code} > 2. if (!isNewMode) the logic in the block would not have invoked, the if > block can be removed > 3. if (isNewMode) always been invoked, the if statement can be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10630) Update migration tests for Flink 1.6
Till Rohrmann created FLINK-10630: - Summary: Update migration tests for Flink 1.6 Key: FLINK-10630 URL: https://issues.apache.org/jira/browse/FLINK-10630 Project: Flink Issue Type: Task Components: Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 Similar to FLINK-10084, we need to update the migration tests for the latest Flink version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10631) Update jepsen tests to run with multiple slots
Till Rohrmann created FLINK-10631: - Summary: Update jepsen tests to run with multiple slots Key: FLINK-10631 URL: https://issues.apache.org/jira/browse/FLINK-10631 Project: Flink Issue Type: Task Components: Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 After fixing FLINK-9455, we should update the Jepsen tests to run with multiple slots per {{TaskExecutor}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10632) Run general purpose test job with failures in per-job mode
Till Rohrmann created FLINK-10632: - Summary: Run general purpose test job with failures in per-job mode Key: FLINK-10632 URL: https://issues.apache.org/jira/browse/FLINK-10632 Project: Flink Issue Type: Sub-task Components: E2E Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 Similar to FLINK-8973, we should add an end-to-end which runs the general datastream job with failures on a per-job cluster with HA enabled (either directly the {{StandaloneJobClusterEntrypoint}} or a docker image based on this entrypoint). We should kill the TMs as well as the cluster entrypoint and verify that the job recovers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10633) End-to-end test: Prometheus reporter
Till Rohrmann created FLINK-10633: - Summary: End-to-end test: Prometheus reporter Key: FLINK-10633 URL: https://issues.apache.org/jira/browse/FLINK-10633 Project: Flink Issue Type: Sub-task Components: E2E Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 Add an end-to-end test using the {{PrometheusReporter}} to verify that all metrics are properly reported. Additionally verify that the newly introduce RocksDB metrics are accessible (see FLINK-10423). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10634) End-to-end test: Metrics accessible via REST API
Till Rohrmann created FLINK-10634: - Summary: End-to-end test: Metrics accessible via REST API Key: FLINK-10634 URL: https://issues.apache.org/jira/browse/FLINK-10634 Project: Flink Issue Type: Sub-task Components: E2E Tests, Metrics, REST Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 Verify that Flink's metrics can be accessed via the REST API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10635) End-to-end test: Take & resume from savepoint when using per-job mode
Till Rohrmann created FLINK-10635: - Summary: End-to-end test: Take & resume from savepoint when using per-job mode Key: FLINK-10635 URL: https://issues.apache.org/jira/browse/FLINK-10635 Project: Flink Issue Type: Sub-task Components: E2E Tests, Tests Affects Versions: 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0 Cancel with savepoint a job running in per-job mode (using the {{StandaloneJobClusterEntrypoint}}) and resume from this savepoint. This effectively that FLINK-10309 has been properly fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-10624) Extend SQL client end-to-end to test new KafkaTableSink
[ https://issues.apache.org/jira/browse/FLINK-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang reassigned FLINK-10624: Assignee: vinoyang > Extend SQL client end-to-end to test new KafkaTableSink > --- > > Key: FLINK-10624 > URL: https://issues.apache.org/jira/browse/FLINK-10624 > Project: Flink > Issue Type: Task > Components: Kafka Connector, Table API & SQL, Tests >Affects Versions: 1.7.0 >Reporter: Till Rohrmann >Assignee: vinoyang >Priority: Critical > Fix For: 1.7.0 > > > With FLINK-9697, we added support for Kafka 2.0. We should also extend the > existing streaming client end-to-end test to also test the new > {{KafkaTableSink}} against Kafka 2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10624) Extend SQL client end-to-end to test new KafkaTableSink
[ https://issues.apache.org/jira/browse/FLINK-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658464#comment-16658464 ] vinoyang commented on FLINK-10624: -- I created an umbrella issue FLINK-10598 to archive the issue related to the modern Kafka connector. Do you agree that I place this issue under FLINK-10598? > Extend SQL client end-to-end to test new KafkaTableSink > --- > > Key: FLINK-10624 > URL: https://issues.apache.org/jira/browse/FLINK-10624 > Project: Flink > Issue Type: Task > Components: Kafka Connector, Table API & SQL, Tests >Affects Versions: 1.7.0 >Reporter: Till Rohrmann >Assignee: vinoyang >Priority: Critical > Fix For: 1.7.0 > > > With FLINK-9697, we added support for Kafka 2.0. We should also extend the > existing streaming client end-to-end test to also test the new > {{KafkaTableSink}} against Kafka 2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-10630) Update migration tests for Flink 1.6
[ https://issues.apache.org/jira/browse/FLINK-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang reassigned FLINK-10630: Assignee: vinoyang > Update migration tests for Flink 1.6 > > > Key: FLINK-10630 > URL: https://issues.apache.org/jira/browse/FLINK-10630 > Project: Flink > Issue Type: Task > Components: Tests >Affects Versions: 1.7.0 >Reporter: Till Rohrmann >Assignee: vinoyang >Priority: Critical > Fix For: 1.7.0 > > > Similar to FLINK-10084, we need to update the migration tests for the latest > Flink version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10603) Reduce kafka test duration
[ https://issues.apache.org/jira/browse/FLINK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658470#comment-16658470 ] ASF GitHub Bot commented on FLINK-10603: yanghua commented on issue #6890: [FLINK-10603] Reduce kafka test duration URL: https://github.com/apache/flink/pull/6890#issuecomment-431724328 I don't know why it seems that I can't update the PR and trigger Travis (China). I don't know how long Github will work, but from the results of my local tests, the correct configuration seems to be: ``` properties.setProperty("default.api.timeout.ms", "3000"); // KafkaProducer#partitionsFor ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Reduce kafka test duration > -- > > Key: FLINK-10603 > URL: https://issues.apache.org/jira/browse/FLINK-10603 > Project: Flink > Issue Type: Sub-task > Components: Kafka Connector, Tests >Affects Versions: 1.7.0 >Reporter: Chesnay Schepler >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > > The tests for the modern kafka connector take more than 10 minutes which is > simply unacceptable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] yanghua commented on issue #6890: [FLINK-10603] Reduce kafka test duration
yanghua commented on issue #6890: [FLINK-10603] Reduce kafka test duration URL: https://github.com/apache/flink/pull/6890#issuecomment-431724328 I don't know why it seems that I can't update the PR and trigger Travis (China). I don't know how long Github will work, but from the results of my local tests, the correct configuration seems to be: ``` properties.setProperty("default.api.timeout.ms", "3000"); // KafkaProducer#partitionsFor ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Assigned] (FLINK-10600) Provide End-to-end test cases for modern Kafka connectors
[ https://issues.apache.org/jira/browse/FLINK-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang reassigned FLINK-10600: Assignee: vinoyang > Provide End-to-end test cases for modern Kafka connectors > - > > Key: FLINK-10600 > URL: https://issues.apache.org/jira/browse/FLINK-10600 > Project: Flink > Issue Type: Sub-task > Components: Kafka Connector >Reporter: vinoyang >Assignee: vinoyang >Priority: Blocker > Fix For: 1.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10636) ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed
dalongliu created FLINK-10636: - Summary: ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed Key: FLINK-10636 URL: https://issues.apache.org/jira/browse/FLINK-10636 Project: Flink Issue Type: Bug Components: Kafka Connector Affects Versions: 1.6.1 Reporter: dalongliu Hi, all, when I upgrade flink version from 1.3.2 to 1.6.1, the version of FlinkKafkaComuser is 0.8. After start my job, encountered the following problems: 2018-10-22 09:54:42,987 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-8615545579287424864.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2018-10-22 09:54:42,993 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.208.75.87/10.208.75.87:2181 2018-10-22 09:54:42,993 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.208.75.85/10.208.75.85:2181 2018-10-22 09:54:42,993 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.208.75.85/10.208.75.85:2181 2018-10-22 09:54:42,994 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.208.75.87/10.208.75.87:2181 2018-10-22 09:54:42,994 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2018-10-22 09:54:42,997 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2018-10-22 09:54:42,994 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 10.208.75.85/10.208.75.85:2181, initiating session -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10491) Deadlock during spilling data in SpillableSubpartition
[ https://issues.apache.org/jira/browse/FLINK-10491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658496#comment-16658496 ] zhijiang commented on FLINK-10491: -- [~till.rohrmann], I think it should be covered in FLINK-1.7 if can catch up with the freeze deadline. > Deadlock during spilling data in SpillableSubpartition > --- > > Key: FLINK-10491 > URL: https://issues.apache.org/jira/browse/FLINK-10491 > Project: Flink > Issue Type: Bug > Components: Network >Affects Versions: 1.5.4, 1.6.1 >Reporter: Piotr Nowojski >Assignee: zhijiang >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.6, 1.6.3, 1.7.0 > > > Originally reported here: > [https://lists.apache.org/thread.html/472c8f4a2711c5e217fadd9a88f8c73670218e7432bb81ba3f5076db@%3Cuser.flink.apache.org%3E] > Thread dump (from 1.5.3 version) showing two deadlocked threads, because they > are taking two locks in different order: > {noformat} > Thread-1 > "DataSink (DATA#HadoopFileOutputFormat ) (1/2)@11002" prio=5 tid=0x3e2 nid=NA > waiting for monitor entry > waiting for Map (Key Extractor) (1/10)@9967 to release lock on <0x2dfb> (a > java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.partition.SpillableSubpartition.releaseMemory(SpillableSubpartition.java:223) > at > org.apache.flink.runtime.io.network.partition.ResultPartition.releaseMemory(ResultPartition.java:373) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:355) > - locked <0x2dfd> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:402) > at > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:203) > - locked <0x2da5> (a java.lang.Object) > at > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:193) > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.returnExclusiveSegments(SingleInputGate.java:318) > at > org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.releaseAllResources(RemoteInputChannel.java:259) > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:578) > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNextBufferOrEvent(SingleInputGate.java:507) > at > org.apache.flink.runtime.io.network.partition.consumer.UnionInputGate.waitAndGetNextInputGate(UnionInputGate.java:213) > at > org.apache.flink.runtime.io.network.partition.consumer.UnionInputGate.getNextBufferOrEvent(UnionInputGate.java:163) > at > org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:86) > at > org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47) > at > org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73) > at > org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:216) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703) > at java.lang.Thread.run(Thread.java:745) > Thread-2 > "Map (Key Extractor) (1/10)@9967" prio=5 tid=0xaab nid=NA waiting for monitor > entry > java.lang.Thread.State: BLOCKED > blocks DataSink (DATA#HadoopFileOutputFormat ) (1/2)@11002 > waiting for DataSink (DATA#HadoopFileOutputFormat ) (1/2)@11002 to release > lock on <0x2dfd> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:261) > at > org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:171) > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:106) > at > org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:146) > at > org.apache.flink.runtime.io.network.buffer.BufferConsumer.close(BufferConsumer.java:110) > at > org.apache.flink.runtime.io.network.partition.SpillableSubpartition.spillFinishedBufferConsumers(SpillableSubpartition.java:271) > at > org.apache.flink.runtime.io.network.partition.SpillableSubpartition.add(SpillableSubpartition.java:117) > - locked <0x2dfb> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.partition.SpillableSubpartition.add(SpillableSubpartition.java:96) > - locked <0x2dfc> (a > org.apache.flink.runtime.io.network.partition.SpillableSubpartition) > at > org.apache.flink.runtime.io.network.partition.ResultPartition.addBufferConsumer(ResultPartition.java:255) > at > org.apache.flink.runtime.io.network.api.writer.Recor
[GitHub] isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource…
isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-431734239 Great discussion, thanks everybody. @wenlong88, the scenario you mention is what i try to fix. [here](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) is a concrete example, a simple word count job will have data inconsistent while failover, the job should fail but success with zero output. @tillrohrmann, **_InputSplitAssigner_** generate a list of _**InputSplit**_, the order might not matter, but every input should be proceed exactly once, if a task fail while process a _**InputSplit**_, this _**InputSplit**_ should be processed again, however, in batch scenario, it might not true, [this](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) repro shows that the current codebase doesn't has this logic and thus it has data inconsistent issue. Its not a problem in Streaming scenario, as the _**InputSplit**_ will be treat as a record, eg: in _**ContinuousFileMonitoringFunction**_, it will collect _**InputSplit**_ and every _**InputSplit**_ will be guaranteed process exactly once by FLINK, @wenlong88 will this work in your scenario? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask
[ https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658517#comment-16658517 ] ASF GitHub Bot commented on FLINK-10205: isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-431734273 Great discussion, thanks everybody. @wenlong88, the scenario you mention is what i try to fix. [here](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) is a concrete example, a simple word count job will have data inconsistent while failover, the job should fail but success with zero output. @tillrohrmann, **_InputSplitAssigner_** generate a list of _**InputSplit**_, the order might not matter, but every input should be proceed exactly once, if a task fail while process a _**InputSplit**_, this _**InputSplit**_ should be processed again, however, in batch scenario, it might not true, [this](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) repro shows that the current codebase doesn't has this logic and thus it has data inconsistent issue. Its not a problem in Streaming scenario, as the _**InputSplit**_ will be treat as a record, eg: in _**ContinuousFileMonitoringFunction**_, it will collect _**InputSplit**_ and every _**InputSplit**_ will be guaranteed process exactly once by FLINK, @wenlong88 will this work in your scenario? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Batch Job: InputSplit Fault tolerant for DataSourceTask > --- > > Key: FLINK-10205 > URL: https://issues.apache.org/jira/browse/FLINK-10205 > Project: Flink > Issue Type: Sub-task > Components: JobManager >Affects Versions: 1.6.1, 1.6.2, 1.7.0 >Reporter: JIN SUN >Assignee: JIN SUN >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Today DataSource Task pull InputSplits from JobManager to achieve better > performance, however, when a DataSourceTask failed and rerun, it will not get > the same splits as its previous version. this will introduce inconsistent > result or even data corruption. > Furthermore, if there are two executions run at the same time (in batch > scenario), this two executions should process same splits. > we need to fix the issue to make the inputs of a DataSourceTask > deterministic. The propose is save all splits into ExecutionVertex and > DataSourceTask will pull split from there. > document: > [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource…
isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-431734273 Great discussion, thanks everybody. @wenlong88, the scenario you mention is what i try to fix. [here](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) is a concrete example, a simple word count job will have data inconsistent while failover, the job should fail but success with zero output. @tillrohrmann, **_InputSplitAssigner_** generate a list of _**InputSplit**_, the order might not matter, but every input should be proceed exactly once, if a task fail while process a _**InputSplit**_, this _**InputSplit**_ should be processed again, however, in batch scenario, it might not true, [this](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) repro shows that the current codebase doesn't has this logic and thus it has data inconsistent issue. Its not a problem in Streaming scenario, as the _**InputSplit**_ will be treat as a record, eg: in _**ContinuousFileMonitoringFunction**_, it will collect _**InputSplit**_ and every _**InputSplit**_ will be guaranteed process exactly once by FLINK, @wenlong88 will this work in your scenario? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask
[ https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658516#comment-16658516 ] ASF GitHub Bot commented on FLINK-10205: isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-431734239 Great discussion, thanks everybody. @wenlong88, the scenario you mention is what i try to fix. [here](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) is a concrete example, a simple word count job will have data inconsistent while failover, the job should fail but success with zero output. @tillrohrmann, **_InputSplitAssigner_** generate a list of _**InputSplit**_, the order might not matter, but every input should be proceed exactly once, if a task fail while process a _**InputSplit**_, this _**InputSplit**_ should be processed again, however, in batch scenario, it might not true, [this](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) repro shows that the current codebase doesn't has this logic and thus it has data inconsistent issue. Its not a problem in Streaming scenario, as the _**InputSplit**_ will be treat as a record, eg: in _**ContinuousFileMonitoringFunction**_, it will collect _**InputSplit**_ and every _**InputSplit**_ will be guaranteed process exactly once by FLINK, @wenlong88 will this work in your scenario? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Batch Job: InputSplit Fault tolerant for DataSourceTask > --- > > Key: FLINK-10205 > URL: https://issues.apache.org/jira/browse/FLINK-10205 > Project: Flink > Issue Type: Sub-task > Components: JobManager >Affects Versions: 1.6.1, 1.6.2, 1.7.0 >Reporter: JIN SUN >Assignee: JIN SUN >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Today DataSource Task pull InputSplits from JobManager to achieve better > performance, however, when a DataSourceTask failed and rerun, it will not get > the same splits as its previous version. this will introduce inconsistent > result or even data corruption. > Furthermore, if there are two executions run at the same time (in batch > scenario), this two executions should process same splits. > we need to fix the issue to make the inputs of a DataSourceTask > deterministic. The propose is save all splits into ExecutionVertex and > DataSourceTask will pull split from there. > document: > [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource…
isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-431735667 Great discussion, thanks everybody. @wenlong88, the scenario you mention is what i try to fix. [here](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) is a concrete example, a simple word count job will have data inconsistent while failover, the job should fail but success with zero output. @tillrohrmann, **_InputSplitAssigner_** generate a list of _**InputSplit**_, the order might not matter, but every input should be proceed exactly once, if a task fail to process a _**InputSplit**_, this _**InputSplit**_ should be processed again, however, in batch scenario, it might not true, the _**DataSourceTask**_ will call _InputSplitAssigner_ to return _**InputSplit**_, depends on the implementation of _InputSplitAssigner_, the failed _**InputSplit**_ might be discard, [this](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) repro shows that _**LocatableInputSplitAssigner**_ will discard failed _**InputSplit**_ and thus it has data inconsistent issue. Its not a problem in Streaming scenario, as the _**InputSplit**_ will be treat as a record, eg: in _**ContinuousFileMonitoringFunction**_, it will collect _**InputSplit**_ and every _**InputSplit**_ will be guaranteed process exactly once by FLINK, @wenlong88 will this work in your scenario? @tillrohrmann, the problem here is that this is a bug, so should we hotfix it instead of waiting new feature available. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource…
isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-431735693 Great discussion, thanks everybody. @wenlong88, the scenario you mention is what i try to fix. [here](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) is a concrete example, a simple word count job will have data inconsistent while failover, the job should fail but success with zero output. @tillrohrmann, **_InputSplitAssigner_** generate a list of _**InputSplit**_, the order might not matter, but every input should be proceed exactly once, if a task fail to process a _**InputSplit**_, this _**InputSplit**_ should be processed again, however, in batch scenario, it might not true, the _**DataSourceTask**_ will call _InputSplitAssigner_ to return _**InputSplit**_, depends on the implementation of _InputSplitAssigner_, the failed _**InputSplit**_ might be discard, [this](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) repro shows that _**LocatableInputSplitAssigner**_ will discard failed _**InputSplit**_ and thus it has data inconsistent issue. Its not a problem in Streaming scenario, as the _**InputSplit**_ will be treat as a record, eg: in _**ContinuousFileMonitoringFunction**_, it will collect _**InputSplit**_ and every _**InputSplit**_ will be guaranteed process exactly once by FLINK, @wenlong88 will this work in your scenario? @tillrohrmann, the problem here is that this is a bug, so should we hotfix it instead of waiting new feature available. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask
[ https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658536#comment-16658536 ] ASF GitHub Bot commented on FLINK-10205: isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-431735693 Great discussion, thanks everybody. @wenlong88, the scenario you mention is what i try to fix. [here](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) is a concrete example, a simple word count job will have data inconsistent while failover, the job should fail but success with zero output. @tillrohrmann, **_InputSplitAssigner_** generate a list of _**InputSplit**_, the order might not matter, but every input should be proceed exactly once, if a task fail to process a _**InputSplit**_, this _**InputSplit**_ should be processed again, however, in batch scenario, it might not true, the _**DataSourceTask**_ will call _InputSplitAssigner_ to return _**InputSplit**_, depends on the implementation of _InputSplitAssigner_, the failed _**InputSplit**_ might be discard, [this](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) repro shows that _**LocatableInputSplitAssigner**_ will discard failed _**InputSplit**_ and thus it has data inconsistent issue. Its not a problem in Streaming scenario, as the _**InputSplit**_ will be treat as a record, eg: in _**ContinuousFileMonitoringFunction**_, it will collect _**InputSplit**_ and every _**InputSplit**_ will be guaranteed process exactly once by FLINK, @wenlong88 will this work in your scenario? @tillrohrmann, the problem here is that this is a bug, so should we hotfix it instead of waiting new feature available. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Batch Job: InputSplit Fault tolerant for DataSourceTask > --- > > Key: FLINK-10205 > URL: https://issues.apache.org/jira/browse/FLINK-10205 > Project: Flink > Issue Type: Sub-task > Components: JobManager >Affects Versions: 1.6.1, 1.6.2, 1.7.0 >Reporter: JIN SUN >Assignee: JIN SUN >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Today DataSource Task pull InputSplits from JobManager to achieve better > performance, however, when a DataSourceTask failed and rerun, it will not get > the same splits as its previous version. this will introduce inconsistent > result or even data corruption. > Furthermore, if there are two executions run at the same time (in batch > scenario), this two executions should process same splits. > we need to fix the issue to make the inputs of a DataSourceTask > deterministic. The propose is save all splits into ExecutionVertex and > DataSourceTask will pull split from there. > document: > [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask
[ https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658535#comment-16658535 ] ASF GitHub Bot commented on FLINK-10205: isunjin commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-431735667 Great discussion, thanks everybody. @wenlong88, the scenario you mention is what i try to fix. [here](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) is a concrete example, a simple word count job will have data inconsistent while failover, the job should fail but success with zero output. @tillrohrmann, **_InputSplitAssigner_** generate a list of _**InputSplit**_, the order might not matter, but every input should be proceed exactly once, if a task fail to process a _**InputSplit**_, this _**InputSplit**_ should be processed again, however, in batch scenario, it might not true, the _**DataSourceTask**_ will call _InputSplitAssigner_ to return _**InputSplit**_, depends on the implementation of _InputSplitAssigner_, the failed _**InputSplit**_ might be discard, [this](https://github.com/isunjin/flink/commit/b61b58d963ea11d34e2eb7ec6f4fe4bfed4dca4a) repro shows that _**LocatableInputSplitAssigner**_ will discard failed _**InputSplit**_ and thus it has data inconsistent issue. Its not a problem in Streaming scenario, as the _**InputSplit**_ will be treat as a record, eg: in _**ContinuousFileMonitoringFunction**_, it will collect _**InputSplit**_ and every _**InputSplit**_ will be guaranteed process exactly once by FLINK, @wenlong88 will this work in your scenario? @tillrohrmann, the problem here is that this is a bug, so should we hotfix it instead of waiting new feature available. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Batch Job: InputSplit Fault tolerant for DataSourceTask > --- > > Key: FLINK-10205 > URL: https://issues.apache.org/jira/browse/FLINK-10205 > Project: Flink > Issue Type: Sub-task > Components: JobManager >Affects Versions: 1.6.1, 1.6.2, 1.7.0 >Reporter: JIN SUN >Assignee: JIN SUN >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Today DataSource Task pull InputSplits from JobManager to achieve better > performance, however, when a DataSourceTask failed and rerun, it will not get > the same splits as its previous version. this will introduce inconsistent > result or even data corruption. > Furthermore, if there are two executions run at the same time (in batch > scenario), this two executions should process same splits. > we need to fix the issue to make the inputs of a DataSourceTask > deterministic. The propose is save all splits into ExecutionVertex and > DataSourceTask will pull split from there. > document: > [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-9761) Potential buffer leak in PartitionRequestClientHandler during job failures
[ https://issues.apache.org/jira/browse/FLINK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658543#comment-16658543 ] zhijiang edited comment on FLINK-9761 at 10/22/18 3:56 AM: --- I just quickly reviewed the related codes and think this is still a problem which exists only in non-credit-based mode. When {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} is called by canceler thread, and the {{stagedBufferResponse}} exists currently. But we directly set {{stagedBufferResponse = null}}, so it has no chance to consume and release this netty message any more resulting in leak issue. Even though the {{stageMessages}} is not empty, the {{stagedMessageHandler}} would only consume and release the messages in this {{stageMessages}} list, and it will not consume and release {{stagedBufferResponse}} firstly. So it still has logic problem I think. Maybe need [~NicoK] double check if I guessed the above issue correctly. was (Author: zjwang): I just quickly reviewed the related codes and think this is still a problem which exists only in non-credit-based mode. When {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} is called by canceler thread, and the {{stagedBufferResponse}} is not currently. But we directly set {{stagedBufferResponse = null}}, so it has no chance to consume and release this netty message any more resulting in leak issue. Even though the {{stageMessages}} is not empty, the {{stagedMessageHandler}} would only consume and release the messages in this {{stageMessages}} list, and it will not consume and release {{stagedBufferResponse}} firstly. So it still has logic problem I think. Maybe need [~NicoK] double check if I guessed the above issue correctly. > Potential buffer leak in PartitionRequestClientHandler during job failures > -- > > Key: FLINK-9761 > URL: https://issues.apache.org/jira/browse/FLINK-9761 > Project: Flink > Issue Type: Bug > Components: Network >Affects Versions: 1.5.0 >Reporter: Nico Kruber >Assignee: Nico Kruber >Priority: Critical > Fix For: 1.5.6, 1.6.3, 1.7.0 > > > {{PartitionRequestClientHandler#stagedMessages}} may be accessed from > multiple threads: > 1) Netty's IO thread > 2) During cancellation, > {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} is > called > If {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} > thinks, {{stagesMessages}} is empty, however, it will not install the > {{stagedMessagesHandler}} that consumes and releases buffers from received > messages. > Unless some unexpected combination of code calls prevents this from > happening, this would leak the non-recycled buffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9761) Potential buffer leak in PartitionRequestClientHandler during job failures
[ https://issues.apache.org/jira/browse/FLINK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658543#comment-16658543 ] zhijiang commented on FLINK-9761: - I just quickly reviewed the related codes and think this is still a problem which exists only in non-credit-based mode. When {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} is called by canceler thread, and the {{stagedBufferResponse}} is not currently. But we directly set {{stagedBufferResponse = null}}, so it has no chance to consume and release this netty message any more resulting in leak issue. Even though the {{stageMessages}} is not empty, the {{stagedMessageHandler}} would only consume and release the messages in this {{stageMessages}} list, and it will not consume and release {{stagedBufferResponse}} firstly. So it still has logic problem I think. Maybe need [~NicoK] double check if I guessed the above issue correctly. > Potential buffer leak in PartitionRequestClientHandler during job failures > -- > > Key: FLINK-9761 > URL: https://issues.apache.org/jira/browse/FLINK-9761 > Project: Flink > Issue Type: Bug > Components: Network >Affects Versions: 1.5.0 >Reporter: Nico Kruber >Assignee: Nico Kruber >Priority: Critical > Fix For: 1.5.6, 1.6.3, 1.7.0 > > > {{PartitionRequestClientHandler#stagedMessages}} may be accessed from > multiple threads: > 1) Netty's IO thread > 2) During cancellation, > {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} is > called > If {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} > thinks, {{stagesMessages}} is empty, however, it will not install the > {{stagedMessagesHandler}} that consumes and releases buffers from received > messages. > Unless some unexpected combination of code calls prevents this from > happening, this would leak the non-recycled buffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737413 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737461 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737435 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658549#comment-16658549 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737413 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658550#comment-16658550 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737435 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737481 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658551#comment-16658551 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737461 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737523 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658552#comment-16658552 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737481 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737558 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737590 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658553#comment-16658553 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737523 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu closed pull request #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu closed pull request #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.gitignore b/.gitignore index 20749c24242..fdf7bedfb26 100644 --- a/.gitignore +++ b/.gitignore @@ -19,6 +19,7 @@ tmp build-target flink-end-to-end-tests/flink-datastream-allround-test/src/main/java/org/apache/flink/streaming/tests/avro/ flink-formats/flink-avro/src/test/java/org/apache/flink/formats/avro/generated/ +flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/generated/ flink-runtime-web/web-dashboard/assets/fonts/ flink-runtime-web/web-dashboard/node_modules/ flink-runtime-web/web-dashboard/bower_components/ diff --git a/flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java b/flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java index 14cf647cd24..4177af72318 100644 --- a/flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java +++ b/flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java @@ -233,6 +233,16 @@ protected static String extractFileExtension(String fileName) { */ protected boolean enumerateNestedFiles = false; + /** +* The flag to specify whether to skip file splits with wrong schema. +*/ + protected boolean skipWrongSchemaFileSplit = false; + + /** +* The flag to specify whether to skip corrupted record. +*/ + protected boolean skipCorruptedRecord = true; + /** * Files filter for determining what files/directories should be included. */ @@ -463,6 +473,14 @@ public void configure(Configuration parameters) { if (!this.enumerateNestedFiles) { this.enumerateNestedFiles = parameters.getBoolean(ENUMERATE_NESTED_FILES_FLAG, false); } + + if (!this.skipWrongSchemaFileSplit) { + this.skipWrongSchemaFileSplit = parameters.getBoolean(SKIP_WRONG_SCHEMA_SPLITS, false); + } + + if (this.skipCorruptedRecord) { + this.skipCorruptedRecord = parameters.getBoolean(SKIP_CORRUPTED_RECORD, true); + } } /** @@ -1077,4 +1095,14 @@ private void abortWait() { * The config parameter which defines whether input directories are recursively traversed. */ public static final String ENUMERATE_NESTED_FILES_FLAG = "recursive.file.enumeration"; + + /** +* The config parameter which defines whether to skip file split with wrong schema. +*/ + public static final String SKIP_WRONG_SCHEMA_SPLITS = "skip.splits.wrong.schema"; + + /** +* The config parameter which defines whether to skip corrupted record. +*/ + public static final String SKIP_CORRUPTED_RECORD = "skip.corrupted.record"; } diff --git a/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java b/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java new file mode 100644 index 000..dab1899a1ce --- /dev/null +++ b/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.formats.parquet; + +import org.apache.flink.api.common.io.CheckpointableInputFormat; +import org.apache.flink.api.common.io.FileInputFormat; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.api.java.typeutils.RowTypeInfo; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.core.fs.FileInputSplit; +import org.apache.flink.core.fs.Path; +import org.apache.flink.formats.parquet.utils.ParquetRecordR
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658556#comment-16658556 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737615 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737642 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737615 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658558#comment-16658558 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737642 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658557#comment-16658557 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu closed pull request #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.gitignore b/.gitignore index 20749c24242..fdf7bedfb26 100644 --- a/.gitignore +++ b/.gitignore @@ -19,6 +19,7 @@ tmp build-target flink-end-to-end-tests/flink-datastream-allround-test/src/main/java/org/apache/flink/streaming/tests/avro/ flink-formats/flink-avro/src/test/java/org/apache/flink/formats/avro/generated/ +flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/generated/ flink-runtime-web/web-dashboard/assets/fonts/ flink-runtime-web/web-dashboard/node_modules/ flink-runtime-web/web-dashboard/bower_components/ diff --git a/flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java b/flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java index 14cf647cd24..4177af72318 100644 --- a/flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java +++ b/flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java @@ -233,6 +233,16 @@ protected static String extractFileExtension(String fileName) { */ protected boolean enumerateNestedFiles = false; + /** +* The flag to specify whether to skip file splits with wrong schema. +*/ + protected boolean skipWrongSchemaFileSplit = false; + + /** +* The flag to specify whether to skip corrupted record. +*/ + protected boolean skipCorruptedRecord = true; + /** * Files filter for determining what files/directories should be included. */ @@ -463,6 +473,14 @@ public void configure(Configuration parameters) { if (!this.enumerateNestedFiles) { this.enumerateNestedFiles = parameters.getBoolean(ENUMERATE_NESTED_FILES_FLAG, false); } + + if (!this.skipWrongSchemaFileSplit) { + this.skipWrongSchemaFileSplit = parameters.getBoolean(SKIP_WRONG_SCHEMA_SPLITS, false); + } + + if (this.skipCorruptedRecord) { + this.skipCorruptedRecord = parameters.getBoolean(SKIP_CORRUPTED_RECORD, true); + } } /** @@ -1077,4 +1095,14 @@ private void abortWait() { * The config parameter which defines whether input directories are recursively traversed. */ public static final String ENUMERATE_NESTED_FILES_FLAG = "recursive.file.enumeration"; + + /** +* The config parameter which defines whether to skip file split with wrong schema. +*/ + public static final String SKIP_WRONG_SCHEMA_SPLITS = "skip.splits.wrong.schema"; + + /** +* The config parameter which defines whether to skip corrupted record. +*/ + public static final String SKIP_CORRUPTED_RECORD = "skip.corrupted.record"; } diff --git a/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java b/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java new file mode 100644 index 000..dab1899a1ce --- /dev/null +++ b/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.formats.parquet; + +import org.apache.flink.api.common.io.CheckpointableInputFormat; +import org.apache.flink.api.common.io.FileInputFormat; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.java.tuple.Tuple2; +im
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658554#comment-16658554 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737558 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658555#comment-16658555 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737590 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737590 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658560#comment-16658560 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737481 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737481 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737435 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658561#comment-16658561 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737590 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658559#comment-16658559 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737435 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737413 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658562#comment-16658562 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu removed a comment on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737413 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737854 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658563#comment-16658563 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737854 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] HuangZhenQiu opened a new pull request #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu opened a new pull request #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483 ## What is the purpose of the change This pull request is to create a ParquetInputFormat, so that Flink can process files in Parquet schema. Parquet record can be read as Row, Pojo and Generic Map. ## Brief change log - *Add schema converter for Parquet types to Flink internal types* - *Add ParquetRecordRead to read Parquet record as Row* - *Add ParquetInputFormat that can be exteneded to convert Row to Pojo or Generic Map* ## Verifying this change This change is already covered by existing tests, such as ParquetRecordReaderTest, ParquetSchemaConverterTest and ParquetInputFormatTest. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (yes) - If yes, how is the feature documented? (JavaDocs) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-7243) Add ParquetInputFormat
[ https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658564#comment-16658564 ] ASF GitHub Bot commented on FLINK-7243: --- HuangZhenQiu opened a new pull request #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483 ## What is the purpose of the change This pull request is to create a ParquetInputFormat, so that Flink can process files in Parquet schema. Parquet record can be read as Row, Pojo and Generic Map. ## Brief change log - *Add schema converter for Parquet types to Flink internal types* - *Add ParquetRecordRead to read Parquet record as Row* - *Add ParquetInputFormat that can be exteneded to convert Row to Pojo or Generic Map* ## Verifying this change This change is already covered by existing tests, such as ParquetRecordReaderTest, ParquetSchemaConverterTest and ParquetInputFormatTest. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (yes) - If yes, how is the feature documented? (JavaDocs) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add ParquetInputFormat > -- > > Key: FLINK-7243 > URL: https://issues.apache.org/jira/browse/FLINK-7243 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: godfrey he >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > > Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zhijiangW opened a new pull request #6891: [FLINK-10606][network][test] Construct NetworkEnvironment simple for tests
zhijiangW opened a new pull request #6891: [FLINK-10606][network][test] Construct NetworkEnvironment simple for tests URL: https://github.com/apache/flink/pull/6891 ## What is the purpose of the change *Implement a simple constructor for `NetworkEnvironment`, which can be widely used for existing tests to remove duplicate codes.* ## Brief change log - *Define a new simple constructor for `NetworkEnvironment`* - *Modify the existing tests based on new constructor* ## Verifying this change This change is already covered by existing tests. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-10606) Construct NetworkEnvironment simple for tests
[ https://issues.apache.org/jira/browse/FLINK-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658611#comment-16658611 ] ASF GitHub Bot commented on FLINK-10606: zhijiangW opened a new pull request #6891: [FLINK-10606][network][test] Construct NetworkEnvironment simple for tests URL: https://github.com/apache/flink/pull/6891 ## What is the purpose of the change *Implement a simple constructor for `NetworkEnvironment`, which can be widely used for existing tests to remove duplicate codes.* ## Brief change log - *Define a new simple constructor for `NetworkEnvironment`* - *Modify the existing tests based on new constructor* ## Verifying this change This change is already covered by existing tests. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Construct NetworkEnvironment simple for tests > - > > Key: FLINK-10606 > URL: https://issues.apache.org/jira/browse/FLINK-10606 > Project: Flink > Issue Type: Test > Components: Network, Tests >Affects Versions: 1.6.1, 1.7.0 >Reporter: zhijiang >Assignee: zhijiang >Priority: Minor > Labels: pull-request-available > > Currently, five tests would create {{NetworkEnvironment}} in different places > and the most of parameters are common in constructing it. > We can provide a simple way for constructing the {{NetworkEnvironment}} with > less parameters to deduplicate the common codes, which can bring benefits for > future new tests. > We can define a static nested {{NetworkEnvironmentBuilder}} inside the > {{NetworkEnvironment or a simple constructor directly as the option.}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10606) Construct NetworkEnvironment simple for tests
[ https://issues.apache.org/jira/browse/FLINK-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-10606: --- Labels: pull-request-available (was: ) > Construct NetworkEnvironment simple for tests > - > > Key: FLINK-10606 > URL: https://issues.apache.org/jira/browse/FLINK-10606 > Project: Flink > Issue Type: Test > Components: Network, Tests >Affects Versions: 1.6.1, 1.7.0 >Reporter: zhijiang >Assignee: zhijiang >Priority: Minor > Labels: pull-request-available > > Currently, five tests would create {{NetworkEnvironment}} in different places > and the most of parameters are common in constructing it. > We can provide a simple way for constructing the {{NetworkEnvironment}} with > less parameters to deduplicate the common codes, which can bring benefits for > future new tests. > We can define a static nested {{NetworkEnvironmentBuilder}} inside the > {{NetworkEnvironment or a simple constructor directly as the option.}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zhijiangW opened a new pull request #6892: [FLINK-10607][network][test] Unify to remove duplicated NoOpResultPartitionConsumableNotifier
zhijiangW opened a new pull request #6892: [FLINK-10607][network][test] Unify to remove duplicated NoOpResultPartitionConsumableNotifier URL: https://github.com/apache/flink/pull/6892 ## What is the purpose of the change *There are currently existing two same `NoOpResultPartitionConsumableNotifier`. We should retain only one and unify all the usages in related tests for avoiding mock.* ## Brief change log - *Keep only one `NoOpResultPartitionConsumableNotifier` for tests* - *Modify all the related tests to avoid mock `ResultPartitionConsumableNotifier`* ## Verifying this change *This change is already covered by existing tests.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Updated] (FLINK-10607) Unify to remove duplicated NoOpResultPartitionConsumableNotifier
[ https://issues.apache.org/jira/browse/FLINK-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-10607: --- Labels: pull-request-available (was: ) > Unify to remove duplicated NoOpResultPartitionConsumableNotifier > > > Key: FLINK-10607 > URL: https://issues.apache.org/jira/browse/FLINK-10607 > Project: Flink > Issue Type: Test > Components: Network, Tests >Affects Versions: 1.6.1, 1.7.0 >Reporter: zhijiang >Assignee: zhijiang >Priority: Minor > Labels: pull-request-available > > Currently there exists two same {{NoOpResultPartitionConsumableNotifier}} > implementations for different tests. We can deduplicate the common codes and > public it for unified usages. And it will also bring benefits for future new > tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10607) Unify to remove duplicated NoOpResultPartitionConsumableNotifier
[ https://issues.apache.org/jira/browse/FLINK-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658662#comment-16658662 ] ASF GitHub Bot commented on FLINK-10607: zhijiangW opened a new pull request #6892: [FLINK-10607][network][test] Unify to remove duplicated NoOpResultPartitionConsumableNotifier URL: https://github.com/apache/flink/pull/6892 ## What is the purpose of the change *There are currently existing two same `NoOpResultPartitionConsumableNotifier`. We should retain only one and unify all the usages in related tests for avoiding mock.* ## Brief change log - *Keep only one `NoOpResultPartitionConsumableNotifier` for tests* - *Modify all the related tests to avoid mock `ResultPartitionConsumableNotifier`* ## Verifying this change *This change is already covered by existing tests.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Unify to remove duplicated NoOpResultPartitionConsumableNotifier > > > Key: FLINK-10607 > URL: https://issues.apache.org/jira/browse/FLINK-10607 > Project: Flink > Issue Type: Test > Components: Network, Tests >Affects Versions: 1.6.1, 1.7.0 >Reporter: zhijiang >Assignee: zhijiang >Priority: Minor > Labels: pull-request-available > > Currently there exists two same {{NoOpResultPartitionConsumableNotifier}} > implementations for different tests. We can deduplicate the common codes and > public it for unified usages. And it will also bring benefits for future new > tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] tzulitai commented on a change in pull request #6875: [FLINK-9808] [state backends] Migrate state when necessary in state backends
tzulitai commented on a change in pull request #6875: [FLINK-9808] [state backends] Migrate state when necessary in state backends URL: https://github.com/apache/flink/pull/6875#discussion_r226911417 ## File path: flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/AbstractRocksDBState.java ## @@ -154,6 +159,29 @@ public void setCurrentNamespace(N namespace) { return backend.db.get(columnFamily, tmpKeySerializationView.getCopyOfBuffer()); } + public byte[] migrateSerializedValue( + byte[] serializedOldValue, + TypeSerializer priorSerializer, + TypeSerializer newSerializer) throws StateMigrationException { + + try { + V value = priorSerializer.deserialize(new DataInputViewStreamWrapper(new ByteArrayInputStream(serializedOldValue))); + + ByteArrayOutputStream baos = new ByteArrayOutputStream(); + DataOutputViewStreamWrapper out = new DataOutputViewStreamWrapper(baos); Review comment: Good idea, will address this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (FLINK-9808) Implement state conversion procedure in state backends
[ https://issues.apache.org/jira/browse/FLINK-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658664#comment-16658664 ] ASF GitHub Bot commented on FLINK-9808: --- tzulitai commented on a change in pull request #6875: [FLINK-9808] [state backends] Migrate state when necessary in state backends URL: https://github.com/apache/flink/pull/6875#discussion_r226911417 ## File path: flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/AbstractRocksDBState.java ## @@ -154,6 +159,29 @@ public void setCurrentNamespace(N namespace) { return backend.db.get(columnFamily, tmpKeySerializationView.getCopyOfBuffer()); } + public byte[] migrateSerializedValue( + byte[] serializedOldValue, + TypeSerializer priorSerializer, + TypeSerializer newSerializer) throws StateMigrationException { + + try { + V value = priorSerializer.deserialize(new DataInputViewStreamWrapper(new ByteArrayInputStream(serializedOldValue))); + + ByteArrayOutputStream baos = new ByteArrayOutputStream(); + DataOutputViewStreamWrapper out = new DataOutputViewStreamWrapper(baos); Review comment: Good idea, will address this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement state conversion procedure in state backends > -- > > Key: FLINK-9808 > URL: https://issues.apache.org/jira/browse/FLINK-9808 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Tzu-Li (Gordon) Tai >Assignee: Aljoscha Krettek >Priority: Critical > Labels: pull-request-available > Fix For: 1.7.0 > > > With FLINK-9377 in place and that config snapshots serve as the single source > of truth for recreating restore serializers, the next step would be to > utilize this when performing a full-pass state conversion (i.e., read with > old / restore serializer, write with new serializer). > For Flink's heap-based backends, it can be seen that state conversion > inherently happens, since all state is always deserialized after restore with > the restore serializer, and written with the new serializer on snapshots. > For the RocksDB state backend, since state is lazily deserialized, state > conversion needs to happen for per-registered state on their first access if > the registered new serializer has a different serialization schema than the > previous serializer. > This task should consist of three parts: > 1. Allow {{CompatibilityResult}} to correctly distinguish between whether the > new serializer's schema is a) compatible with the serializer as it is, b) > compatible after the serializer has been reconfigured, or c) incompatible. > 2. Introduce state conversion procedures in the RocksDB state backend. This > should occur on the first state access. > 3. Make sure that all other backends no longer do redundant serializer > compatibility checks. That is not required because those backends always > perform full-pass state conversions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10634) End-to-end test: Metrics accessible via REST API
[ https://issues.apache.org/jira/browse/FLINK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658666#comment-16658666 ] Chesnay Schepler commented on FLINK-10634: -- I don't believe this test to be necessary as several E2E tests already use the metric system (both via reporters and the REST API). > End-to-end test: Metrics accessible via REST API > > > Key: FLINK-10634 > URL: https://issues.apache.org/jira/browse/FLINK-10634 > Project: Flink > Issue Type: Sub-task > Components: E2E Tests, Metrics, REST >Affects Versions: 1.7.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.7.0 > > > Verify that Flink's metrics can be accessed via the REST API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] tzulitai commented on a change in pull request #6875: [FLINK-9808] [state backends] Migrate state when necessary in state backends
tzulitai commented on a change in pull request #6875: [FLINK-9808] [state backends] Migrate state when necessary in state backends URL: https://github.com/apache/flink/pull/6875#discussion_r226912089 ## File path: flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java ## @@ -1389,6 +1388,149 @@ private void copyStateDataHandleData( return Tuple2.of(stateInfo.f0, newMetaInfo); } + private RegisteredKeyValueStateBackendMetaInfo migrateStateIfNecessary( + StateDescriptor stateDesc, + TypeSerializer namespaceSerializer, + Tuple2 stateInfo, + @Nullable StateSnapshotTransformer snapshotTransformer) throws Exception { + + StateMetaInfoSnapshot restoredMetaInfoSnapshot = restoredKvStateMetaInfos.get(stateDesc.getName()); + + Preconditions.checkState( + restoredMetaInfoSnapshot != null, + "Requested to check compatibility of a restored RegisteredKeyedBackendStateMetaInfo," + + " but its corresponding restored snapshot cannot be found."); + + Preconditions.checkState(restoredMetaInfoSnapshot.getBackendStateType() + == StateMetaInfoSnapshot.BackendStateType.KEY_VALUE, + "Incompatible state types. " + + "Was [" + restoredMetaInfoSnapshot.getBackendStateType() + "], " + + "registered as [" + StateMetaInfoSnapshot.BackendStateType.KEY_VALUE + "]."); + + Preconditions.checkState( + Objects.equals(stateDesc.getName(), restoredMetaInfoSnapshot.getName()), + "Incompatible state names. " + + "Was [" + restoredMetaInfoSnapshot.getName() + "], " + + "registered with [" + stateDesc.getName() + "]."); + + final StateDescriptor.Type restoredType = + StateDescriptor.Type.valueOf( + restoredMetaInfoSnapshot.getOption( + StateMetaInfoSnapshot.CommonOptionsKeys.KEYED_STATE_TYPE)); + + if (!Objects.equals(stateDesc.getType(), StateDescriptor.Type.UNKNOWN) + && !Objects.equals(restoredType, StateDescriptor.Type.UNKNOWN)) { + + Preconditions.checkState( + stateDesc.getType() == restoredType, + "Incompatible key/value state types. " + + "Was [" + restoredType + "], " + + "registered with [" + stateDesc.getType() + "]."); + } + + TypeSerializer stateSerializer = stateDesc.getSerializer(); + + RegisteredKeyValueStateBackendMetaInfo newMetaInfo = new RegisteredKeyValueStateBackendMetaInfo<>( + stateDesc.getType(), + stateDesc.getName(), + namespaceSerializer, + stateSerializer, + snapshotTransformer); + + CompatibilityResult namespaceCompatibility = CompatibilityUtil.resolveCompatibilityResult( Review comment: Yes, this indeed is not nice. We actually have a newer replacement class called `TypeSerializerSchemaCompatibility`, but at the moment the use of `CompatibilityResult` can't be completely avoided yet. I have been discussing with @dawidwys that we maybe should just remove `CompatibilityResult` completely in favor of the newer replacement, but this might come up as a separate commit / PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tzulitai commented on a change in pull request #6875: [FLINK-9808] [state backends] Migrate state when necessary in state backends
tzulitai commented on a change in pull request #6875: [FLINK-9808] [state backends] Migrate state when necessary in state backends URL: https://github.com/apache/flink/pull/6875#discussion_r226912034 ## File path: flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java ## @@ -1389,6 +1388,149 @@ private void copyStateDataHandleData( return Tuple2.of(stateInfo.f0, newMetaInfo); } + private RegisteredKeyValueStateBackendMetaInfo migrateStateIfNecessary( + StateDescriptor stateDesc, + TypeSerializer namespaceSerializer, + Tuple2 stateInfo, + @Nullable StateSnapshotTransformer snapshotTransformer) throws Exception { + + StateMetaInfoSnapshot restoredMetaInfoSnapshot = restoredKvStateMetaInfos.get(stateDesc.getName()); + + Preconditions.checkState( + restoredMetaInfoSnapshot != null, + "Requested to check compatibility of a restored RegisteredKeyedBackendStateMetaInfo," + + " but its corresponding restored snapshot cannot be found."); + + Preconditions.checkState(restoredMetaInfoSnapshot.getBackendStateType() + == StateMetaInfoSnapshot.BackendStateType.KEY_VALUE, + "Incompatible state types. " + + "Was [" + restoredMetaInfoSnapshot.getBackendStateType() + "], " + + "registered as [" + StateMetaInfoSnapshot.BackendStateType.KEY_VALUE + "]."); + + Preconditions.checkState( + Objects.equals(stateDesc.getName(), restoredMetaInfoSnapshot.getName()), + "Incompatible state names. " + + "Was [" + restoredMetaInfoSnapshot.getName() + "], " + + "registered with [" + stateDesc.getName() + "]."); + + final StateDescriptor.Type restoredType = + StateDescriptor.Type.valueOf( + restoredMetaInfoSnapshot.getOption( + StateMetaInfoSnapshot.CommonOptionsKeys.KEYED_STATE_TYPE)); + + if (!Objects.equals(stateDesc.getType(), StateDescriptor.Type.UNKNOWN) + && !Objects.equals(restoredType, StateDescriptor.Type.UNKNOWN)) { + + Preconditions.checkState( + stateDesc.getType() == restoredType, + "Incompatible key/value state types. " + + "Was [" + restoredType + "], " + + "registered with [" + stateDesc.getType() + "]."); + } + + TypeSerializer stateSerializer = stateDesc.getSerializer(); + + RegisteredKeyValueStateBackendMetaInfo newMetaInfo = new RegisteredKeyValueStateBackendMetaInfo<>( + stateDesc.getType(), + stateDesc.getName(), + namespaceSerializer, + stateSerializer, + snapshotTransformer); + + CompatibilityResult namespaceCompatibility = CompatibilityUtil.resolveCompatibilityResult( Review comment: Yes, this indeed is not nice. We actually have a newer replacement class called `TypeSerializerSchemaCompatibility`, but at the moment the use of `CompatibilityResult` can't be completely avoided yet. I have been discussing with @dawidwys that we maybe should just remove `CompatibilityResult` completely in favor of the newer replacement, but this might come up as a separate commit / PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services