[jira] [Created] (FLINK-4310) Move BinaryCompatibility Check plugin to relevant projects
Stephan Ewen created FLINK-4310: --- Summary: Move BinaryCompatibility Check plugin to relevant projects Key: FLINK-4310 URL: https://issues.apache.org/jira/browse/FLINK-4310 Project: Flink Issue Type: Improvement Components: Build System Affects Versions: 1.1.0 Reporter: Stephan Ewen The Maven plugin that checks binary compatibility is currently run for every project, rather than only the once where we have public API classes. Since the plugin contributes to build instability in some cases, we can improve stability by running it only in the relevant projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4311) TableInputFormat fails when reused on next split
Niels Basjes created FLINK-4311: --- Summary: TableInputFormat fails when reused on next split Key: FLINK-4311 URL: https://issues.apache.org/jira/browse/FLINK-4311 Project: Flink Issue Type: Bug Affects Versions: 1.0.3 Reporter: Niels Basjes Assignee: Niels Basjes Priority: Critical We have written a batch job that uses data from HBase by means of using the TableInputFormat. We have found that this class sometimes fails with this exception: {quote} java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@4f4efe4b rejected from java.util.concurrent.ThreadPoolExecutor@7872d5c1[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1165] at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) at org.apache.hadoop.hbase.client.ClientScanner.(ClientScanner.java:155) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821) at org.apache.flink.addons.hbase.TableInputFormat.open(TableInputFormat.java:152) at org.apache.flink.addons.hbase.TableInputFormat.open(TableInputFormat.java:47) at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:147) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@4f4efe4b rejected from java.util.concurrent.ThreadPoolExecutor@7872d5c1[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1165] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService.submit(ResultBoundedCompletionService.java:142) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.addCallsForCurrentReplica(ScannerCallableWithReplicas.java:269) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:165) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) ... 10 more {quote} As you can see the ThreadPoolExecutor was terminated at this point. We tracked it down to the fact that # the configure method opens the table # the open method obtains the result scanner # the closes method closes the table. If a second split arrives on the same instance then the open method will fail because the table has already been closed. We also found that this error varies with the versions of HBase that are used. I have also seen this exception: {quote} Caused by: java.io.IOException: hconnection-0x19d37183 closed at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1146) at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300) ... 37 more {quote} I found that in the [documentation of the InputFormat interface|https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/common/io/InputFormat.html] is clearly states {quote}IMPORTANT NOTE: Input formats must be written such that an instance can be opened again after it was closed. That is due to the fact that the input format is used for potentially multiple splits. After a split is done, the format's close function is invoked and, if another split is available, the open function is invoked afterwards for the next split.{quote} It appears that this specific InputFormat has not been checked against this constraint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Map Reduce Sorting
Hi, I have another question. The reducer sorts its inputs before it starts with computation. Which sorting algorithm it is using? In Flink I found QuickSort, HeapSort and etc. Does the sorting algorithm benefit from pre-sorted partitions. For example, a MergeSort algorithm can sort the partitions of multiple maps together to create a single sorted partition for the reducer. If the map partitions are already sorted, then the MergeSort algorithm can run faster. Are there any benefits if the map partitions are sorted? Thank you BR, Hilmi Am 02.08.2016 um 10:01 schrieb Hilmi Yildirim: Hi Fabian, thank you very much! This answers my question. BR, Hilmi Am 01.08.2016 um 22:29 schrieb Fabian Hueske: Hi Hilmi, the results of the combiner are usually not completely sorted and if they are this property is not leveraged. This is due to the following reasons: 1) a sort-combiner only sorts as much data as fits into memory. If there is more data, the result consists of multiple sorted sequences. 2) since recently, Flink features a hash-based combiner which is usually more efficient and does not produce sorted output. 3) Flink's pipelined shipping strategy would require that the receiver merges the result records from all senders on the fly while receiving data via the network. In case of a straggling sender task all other senders would be blocked due to backpressure. In addition, this would only work if the combiner does a full sort and not several in-memory sorts. So, a Reducer will always do a full sort of all received data before applying the Reduce function (if available, a combiner is applied before data is written to disk in case of an external sort). Hope this helps, Fabian 2016-08-01 18:25 GMT+02:00 Hilmi Yildirim : Hi, I have a question regarding when data points are sorted when applying a simple Map Reduce Job. I have the following code: data = readFromSource() data.map().groupBy(0).reduce(...) This code will be translated into the following execution plan: map -> combiner -> hash partitioning and sorting on 0 -> reduce. If I am right then the combiner firstly sorts the data, then it applies the combine function, and then it partitions the result. Now the partitions are consumed by the reducers. For each mapper/combiner machine, the reducer has an input gateway. For example, the mappers and combiners run on 10 machines, then each reducer has 10 input gateways. Now, the reducer consumes the data via a MutableObjectIterator. This iterator firstly consumes data from one input gateway, then from the other and so on. Is the data of a single input gateway already sorted? Because the combiner function has sorted the data already. Is the order of the data points maintained after they are sent through the network? In my code, the MutableObjectIterator instances are subclasses of NormalizedKeySorter. Does this mean that the data from an input gateway is firstly sorted before it is handover to the reduce function? Is this because the order of the data points is not mainted after sending through the network? It would be nice if someone can answer my question. If my assumptions are wrong, please correct me :) BR, Hilmi -- == Hilmi Yildirim, M.Sc. Researcher DFKI GmbH Intelligente Analytik für Massendaten DFKI Projektbüro Berlin Alt-Moabit 91c D-10559 Berlin Phone: +49 30 23895 1814 E-Mail: hilmi.yildi...@dfki.de - Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 -
Re: [VOTE] Release Apache Flink 1.1.0 (RC2)
+1 from my side LICENSE and NOTICE files are good No binaries are in the release Build and tests run for - Hadoop 2.7.2 & Scala 2.11 (Maven 3.0.5) Tested Local standalone installation, logs, out, all good Tested different memory allocation schemes (heap/offheap) (preallocate/lazy allocate) Web UI works as expected On Tue, Aug 2, 2016 at 11:29 PM, Ufuk Celebi wrote: > Dear Flink community, > > Please vote on releasing the following candidate as Apache Flink version > 1.1.0. > > The commit to be voted on: > 45f7825 (http://git-wip-us.apache.org/repos/asf/flink/commit/45f7825) > > Branch: > release-1.1.0-rc2 > ( > https://git1-us-west.apache.org/repos/asf/flink/repo?p=flink.git;a=shortlog;h=refs/heads/release-1.1.0-rc2 > ) > > The release artifacts to be voted on can be found at: > http://people.apache.org/~uce/flink-1.1.0-rc2/ > > The release artifacts are signed with the key with fingerprint 9D403309: > http://www.apache.org/dist/flink/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapacheflink-1100 > > There is also a Google doc to coordinate the testing efforts. This is > a the document used for RC1 with some tests removed: > > https://docs.google.com/document/d/1cDZGtnGJKLU1fLw8AE_FzkoDLOR8amYT2oc3mD0_lw4/edit?usp=sharing > > - > > Thanks to everyone who contributed to this release candidate. > > The vote is open for the next 24 hours (as per discussion in the RC1 > vote thread) and passes if a majority of at least three +1 PMC votes > are cast. > > The vote ends on Wedesday, August 3rd, 2016. > > [ ] +1 Release this package as Apache Flink 1.1.0 > [ ] -1 Do not release this package, because ... > > The following commits have been added since RC1: > > - c71a0c7: [FLINK-4307] [streaming API] Restore ListState behavior for > user-facing ListStates > - ac7028e: Revert "[FLINK-4154] [core] Correction of murmur hash > breaks backwards compatibility" >
Re: [VOTE] Release Apache Flink 1.1.0 (RC2)
+1 from my side Checked LICENSE and NOTICE files Build and tests run for - Hadoop 2.6.0 SBT quickstarts work Basic stream SQL features work Metrics work with Ganglia, Graphite, JMX and StatsD On Wed, Aug 3, 2016 at 9:56 PM, Stephan Ewen wrote: > +1 from my side > > LICENSE and NOTICE files are good > > No binaries are in the release > > Build and tests run for > - Hadoop 2.7.2 & Scala 2.11 (Maven 3.0.5) > > Tested Local standalone installation, logs, out, all good > > Tested different memory allocation schemes (heap/offheap) (preallocate/lazy > allocate) > > Web UI works as expected > > > > On Tue, Aug 2, 2016 at 11:29 PM, Ufuk Celebi wrote: > > > Dear Flink community, > > > > Please vote on releasing the following candidate as Apache Flink version > > 1.1.0. > > > > The commit to be voted on: > > 45f7825 (http://git-wip-us.apache.org/repos/asf/flink/commit/45f7825) > > > > Branch: > > release-1.1.0-rc2 > > ( > > > https://git1-us-west.apache.org/repos/asf/flink/repo?p=flink.git;a=shortlog;h=refs/heads/release-1.1.0-rc2 > > ) > > > > The release artifacts to be voted on can be found at: > > http://people.apache.org/~uce/flink-1.1.0-rc2/ > > > > The release artifacts are signed with the key with fingerprint 9D403309: > > http://www.apache.org/dist/flink/KEYS > > > > The staging repository for this release can be found at: > > https://repository.apache.org/content/repositories/orgapacheflink-1100 > > > > There is also a Google doc to coordinate the testing efforts. This is > > a the document used for RC1 with some tests removed: > > > > > https://docs.google.com/document/d/1cDZGtnGJKLU1fLw8AE_FzkoDLOR8amYT2oc3mD0_lw4/edit?usp=sharing > > > > - > > > > Thanks to everyone who contributed to this release candidate. > > > > The vote is open for the next 24 hours (as per discussion in the RC1 > > vote thread) and passes if a majority of at least three +1 PMC votes > > are cast. > > > > The vote ends on Wedesday, August 3rd, 2016. > > > > [ ] +1 Release this package as Apache Flink 1.1.0 > > [ ] -1 Do not release this package, because ... > > > > The following commits have been added since RC1: > > > > - c71a0c7: [FLINK-4307] [streaming API] Restore ListState behavior for > > user-facing ListStates > > - ac7028e: Revert "[FLINK-4154] [core] Correction of murmur hash > > breaks backwards compatibility" > > >
Re: [VOTE] Release Apache Flink 1.1.0 (RC2)
+1 I tested (on the last RC, which is also applicable here): - read README - did "mvn clean verify" - source contains no binaries - a few weeks back I extensively tested CEP and the Cassandra sink (we fixed stuff and now everything is good to go) - verify that quickstarts point to correct version On Wed, 3 Aug 2016 at 08:29 Till Rohrmann wrote: > +1 from my side > > Checked LICENSE and NOTICE files > > Build and tests run for > - Hadoop 2.6.0 > > SBT quickstarts work > > Basic stream SQL features work > > Metrics work with Ganglia, Graphite, JMX and StatsD > > On Wed, Aug 3, 2016 at 9:56 PM, Stephan Ewen wrote: > > > +1 from my side > > > > LICENSE and NOTICE files are good > > > > No binaries are in the release > > > > Build and tests run for > > - Hadoop 2.7.2 & Scala 2.11 (Maven 3.0.5) > > > > Tested Local standalone installation, logs, out, all good > > > > Tested different memory allocation schemes (heap/offheap) > (preallocate/lazy > > allocate) > > > > Web UI works as expected > > > > > > > > On Tue, Aug 2, 2016 at 11:29 PM, Ufuk Celebi wrote: > > > > > Dear Flink community, > > > > > > Please vote on releasing the following candidate as Apache Flink > version > > > 1.1.0. > > > > > > The commit to be voted on: > > > 45f7825 (http://git-wip-us.apache.org/repos/asf/flink/commit/45f7825) > > > > > > Branch: > > > release-1.1.0-rc2 > > > ( > > > > > > https://git1-us-west.apache.org/repos/asf/flink/repo?p=flink.git;a=shortlog;h=refs/heads/release-1.1.0-rc2 > > > ) > > > > > > The release artifacts to be voted on can be found at: > > > http://people.apache.org/~uce/flink-1.1.0-rc2/ > > > > > > The release artifacts are signed with the key with fingerprint > 9D403309: > > > http://www.apache.org/dist/flink/KEYS > > > > > > The staging repository for this release can be found at: > > > https://repository.apache.org/content/repositories/orgapacheflink-1100 > > > > > > There is also a Google doc to coordinate the testing efforts. This is > > > a the document used for RC1 with some tests removed: > > > > > > > > > https://docs.google.com/document/d/1cDZGtnGJKLU1fLw8AE_FzkoDLOR8amYT2oc3mD0_lw4/edit?usp=sharing > > > > > > - > > > > > > Thanks to everyone who contributed to this release candidate. > > > > > > The vote is open for the next 24 hours (as per discussion in the RC1 > > > vote thread) and passes if a majority of at least three +1 PMC votes > > > are cast. > > > > > > The vote ends on Wedesday, August 3rd, 2016. > > > > > > [ ] +1 Release this package as Apache Flink 1.1.0 > > > [ ] -1 Do not release this package, because ... > > > > > > The following commits have been added since RC1: > > > > > > - c71a0c7: [FLINK-4307] [streaming API] Restore ListState behavior for > > > user-facing ListStates > > > - ac7028e: Revert "[FLINK-4154] [core] Correction of murmur hash > > > breaks backwards compatibility" > > > > > >
[ANNOUNCE] Introducing a feature branch for FLIP-6 (cluster management)
Hi all! We would like to start working on FLIP-6. Because it is such a big change, I would like to start developing it concurrently to the master branch for a while, and merge it once it is in a stable state. For that reason, I'd like to fork a *feature branch "clustermanagement" *in the Flink repository. Please let me know if you have concerns with respect to that. Greetings, Stephan
Re: Introduction
Welcome! :) On Sunday, July 31, 2016, Neelesh Salian wrote: > Hello folks, > > I am Neelesh Salian; I recently joined the Flink community and I wanted to > take this opportunity to formally introduce myself. > > I have been working with the Hadoop and Spark ecosystems over the past two > years and found Flink really interesting in the Streaming use case. > > > Excited to start working and help build the community. :) > Thank you. > Have a great Sunday. > > > -- > Neelesh Srinivas Salian > Customer Operations Engineer >
Re: [ANNOUNCE] Introducing a feature branch for FLIP-6 (cluster management)
+1 seems good On Wed, 3 Aug 2016 at 11:05 Stephan Ewen wrote: > Hi all! > > We would like to start working on FLIP-6. > > Because it is such a big change, I would like to start developing it > concurrently to the master branch for a while, and merge it once it is in a > stable state. > > For that reason, I'd like to fork a *feature branch "clustermanagement" *in > the Flink repository. Please let me know if you have concerns with respect > to that. > > Greetings, > Stephan >