pnowojski commented on a change in pull request #468: URL: https://github.com/apache/flink-web/pull/468#discussion_r717258622
########## File path: _posts/2021-09-21-release-1.14.0.md ########## @@ -0,0 +1,344 @@ +--- +layout: post +title: "Apache Flink 1.14.0 Release Announcement" +date: 2021-09-21T08:00:00.000Z +categories: news +authors: +- joemoe: + name: "Johannes Moser" + +excerpt: The Apache Flink community is excited to announce the release of Flink 1.14.0! More than 200 contributor worked on over 1,000 issues. The release brings exciting new features like a more seamless streaming/batch integration, automatic network memory tuning, a hybrid source to switch data streams between storgage systems (e.g., Kafka/S3), Fine-grained resource management, PyFlink performance and debugging enhancements, and a Pulsar connector. + +--- + +The Apache Software Foundation recently released its annual report and Apache Flink once again made +it on the list of the top 5 most active projects! This remarkable +activity also shows in the new 1.14.0 release. Once again, more than 200 contributors worked on +over 1,000 issues. We are proud of how this community is consistently moving the project forward. + +This release brings many new features and improvements in areas such as the SQL API, more connector support, checkpointing, and PyFlink. +A major area of changes in this release is the integrated streaming & batch experience. We believe +that, in practice, unbounded stream processing goes hand-in-hand with bounded- and batch processing tasks in practice, +because many use cases require processing historic data from various sources alongside streaming data. +Examples are data exploration when developing new applications, bootstrapping state for new applications, training +models to be applied in a streaming application, re-processing data after fixes/upgrades, and . + +In Flink 1.14, we finally made it possible to **mix bounded and unbounded streams in an application**: +Flink now supports taking checkpoints of applications that are partially running and partially finished (some +operators reached the end of the bounded inputs). Additionally, **bounded streams now take a final checkpoint** +when reaching their end to ensure smooth committing of results in sinks. + +The **batch execution mode now supports programs that use a mixture of the DataStream API and the SQL/Table API** +(previously only pure Table/SQL or DataStream programs). + +The unified Source and Sink APIs have gotten an update, and we started **consolidating the connector ecosystem around the unified APIs**. We added a new **hybrid source** can bridge between multiple storage systems. +You can now do things like read old data from Amazon S3 and then switch over to Apache Kafka. + +Furthermore, this release takes another step in our initiative of making Flink more self-tuning and +to require less Stream-Processor-specific knowledge to operate. We started that initiative in the previous +release with [Reactive Scaling](/news/2021/05/03/release-1.13.0.html#reactive-scaling) and are now +adding **automatic network memory tuning** (*a.k.a. Buffer Debloating*). This feature speeds up checkpoints +under load without sacrificing performance or increasing checkpoint size, by continuously adjusting the +network buffering to ensure best throughput while having minimal in-flight data. See the +[Buffer Debloating section](#buffer-debloating) for details. + +In addition, this release furthers our initiative in making Flink more self-tuning. We aim make Flink +easier to operate without necessarily requiring a lot of Stream-Processor-specific knowledge. +We started this initiative in the previous release with [reactive scaling](/news/2021/05/03/release-1.13.0.html#reactive-scaling) +and are now adding **automatic network memory tuning** (*a.k.a. Buffer Debloating*). +This feature speeds up checkpoints under high load without sacrificing performance or +increasing checkpoint size by continuously adjusting network buffers to ensure the best +throughput while having minimal in-flight data. See the [Buffer Debloating section](#buffer-debloating) +for more details. Review comment: Those paragraphs are duplicated ########## File path: _posts/2021-09-21-release-1.14.0.md ########## @@ -0,0 +1,344 @@ +--- +layout: post +title: "Apache Flink 1.14.0 Release Announcement" +date: 2021-09-21T08:00:00.000Z +categories: news +authors: +- joemoe: + name: "Johannes Moser" + +excerpt: The Apache Flink community is excited to announce the release of Flink 1.14.0! More than 200 contributor worked on over 1,000 issues. The release brings exciting new features like a more seamless streaming/batch integration, automatic network memory tuning, a hybrid source to switch data streams between storgage systems (e.g., Kafka/S3), Fine-grained resource management, PyFlink performance and debugging enhancements, and a Pulsar connector. + +--- + +The Apache Software Foundation recently released its annual report and Apache Flink once again made +it on the list of the top 5 most active projects! This remarkable +activity also shows in the new 1.14.0 release. Once again, more than 200 contributors worked on +over 1,000 issues. We are proud of how this community is consistently moving the project forward. + +This release brings many new features and improvements in areas such as the SQL API, more connector support, checkpointing, and PyFlink. +A major area of changes in this release is the integrated streaming & batch experience. We believe +that, in practice, unbounded stream processing goes hand-in-hand with bounded- and batch processing tasks in practice, +because many use cases require processing historic data from various sources alongside streaming data. +Examples are data exploration when developing new applications, bootstrapping state for new applications, training +models to be applied in a streaming application, re-processing data after fixes/upgrades, and . + +In Flink 1.14, we finally made it possible to **mix bounded and unbounded streams in an application**: +Flink now supports taking checkpoints of applications that are partially running and partially finished (some +operators reached the end of the bounded inputs). Additionally, **bounded streams now take a final checkpoint** +when reaching their end to ensure smooth committing of results in sinks. + +The **batch execution mode now supports programs that use a mixture of the DataStream API and the SQL/Table API** +(previously only pure Table/SQL or DataStream programs). + +The unified Source and Sink APIs have gotten an update, and we started **consolidating the connector ecosystem around the unified APIs**. We added a new **hybrid source** can bridge between multiple storage systems. +You can now do things like read old data from Amazon S3 and then switch over to Apache Kafka. + +Furthermore, this release takes another step in our initiative of making Flink more self-tuning and +to require less Stream-Processor-specific knowledge to operate. We started that initiative in the previous +release with [Reactive Scaling](/news/2021/05/03/release-1.13.0.html#reactive-scaling) and are now +adding **automatic network memory tuning** (*a.k.a. Buffer Debloating*). This feature speeds up checkpoints +under load without sacrificing performance or increasing checkpoint size, by continuously adjusting the +network buffering to ensure best throughput while having minimal in-flight data. See the +[Buffer Debloating section](#buffer-debloating) for details. + +In addition, this release furthers our initiative in making Flink more self-tuning. We aim make Flink +easier to operate without necessarily requiring a lot of Stream-Processor-specific knowledge. +We started this initiative in the previous release with [reactive scaling](/news/2021/05/03/release-1.13.0.html#reactive-scaling) +and are now adding **automatic network memory tuning** (*a.k.a. Buffer Debloating*). +This feature speeds up checkpoints under high load without sacrificing performance or +increasing checkpoint size by continuously adjusting network buffers to ensure the best +throughput while having minimal in-flight data. See the [Buffer Debloating section](#buffer-debloating) +for more details. + +There are many more improvements and new additions throughout various components, as we discuss below. +We also had to say goodbye to some features that have been superceded by newer ones in recent releases, +most prominently we are **removing the old SQL execution engine** and are +**removing the active integration with Apache Mesos**. + +We hope you like the new release and we'd be eager to learn about your experience with it, which yet +unsolved problems it solves, what new use-cases it unlocks for you. + +{% toc %} + +# Unified Batch and Stream Processing experience + +One of Flink's unique characteristics is how it integrates streaming and batch processing, +using common unified APIs, and a runtime that supports multiple execution paradigms. + +As motivated in the introduction, we believe Streaming and Batch always go hand in hand. This quote from +a [report on Facebook's streaming infrastructure](https://research.fb.com/wp-content/uploads/2016/11/realtime_data_processing_at_facebook.pdf) +echos this sentiment nicely. + +> Streaming versus batch processing is not an either/or decision. Originally, all data warehouse +> processing at Facebook was batch processing. We began developing Puma and Swift about five years +> ago. As we showed in Section [...], using a mix of streaming and batch processing can speed up +> long pipelines by hours. + +Having both the real-time and the historic computations in the same engine also ensures consistency +between semantics and makes results well comparable. Here is an [article by Alibaba](https://www.ververica.com/blog/apache-flinks-stream-batch-unification-powers-alibabas-11.11-in-2020) +about unifying business reporting with Apache Flink and getting consistent reports that way. + +While unified streaming & batch are already possible in earlier versions, this release brings +some features that unlock new use cases, as well as a series of quality-of-life improvements. + +## Checkpointing and Bounded Streams + +Flink's checkpointing mechanism could originally only draw checkpoints when all tasks in an application's +DAG were running. That means that mixed bounded/unbounded applications were not really possible. +In addition, applications on bounded inputs that were executed in a streaming way (not in a batch way) +stopped checkpointing towards the end, when some tasks finished. Lingering data for exactly-once +sinks was the result, because in the absence of checkpoints, the latest output data was not committed. + +With [FLIP-147](https://cwiki.apache.org/confluence/display/FLINK/FLIP-147%3A+Support+Checkpoints+After+Tasks+Finished) +Flink now supports checkpoints after tasks are finished, and takes a final checkpoint at the end of a +bounded stream, ensuring that all sink data is committed before the job ends (similar to how +*stop-with-savepoint* behaves). + +To activate this feature, add `execution.checkpointing.checkpoints-after-tasks-finish.enabled: true` +to your configuration. Keeping with the tradition of making big new features opt-in for the first release, +the feature is not active by default in Flink 1.14. We expect it to become the default in the next release. + +As a side note: Why would we even want to execute applications over bounded streams in a streaming way, +rather than in a batch-y way? There can be many reasons, like (a) the sink that you write to needs to be +written to in a streaming way (like the Kafka Sink), or (b) that you do want to exploit the +streaming-inherent quasi-ordering-by-time, such as in the [Kappa+ Architecture](https://youtu.be/4qSlsYogALo?t=666). + +## Mixing Batch DataStream API and Table API/SQL + +SQL and the Table API are becoming the default starting points for new projects. The declarative +nature and richness of built-in types and operations make it easy to develop applications fast. +It is not uncommon, though, for developers to eventually hit the limit of SQL's expressiveness for +certain types of event-driven business logic (or hit the point when it becomes grotesque to express +that logic in SQL). + +At that point, the natural step is to blend in a piece of stateful DataStream API logic, before +switching back to SQL again. + +In Flink 1.14, bounded batch-executed SQL/Table programs can convert their intermediate +Tables to a DataStream, apply some DataSteam API operations, and convert it back to a Table. Flink builds +a dataflow DAG mixing declarative optimized SQL execution with batch-executed DataStream logic. +Check out the [relevant docs](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/data_stream_api/#converting-between-datastream-and-table) for details. + +## Hybrid Source + +The new [Hybrid Source](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/hybridsource/) +produces a combined stream from other sources, by reading those other sources one after the other, +seamlessly switching over from one to the other. + +With the Hybrid Source, you can read for example from tiered storage setups as if there was one +stream that spans everything. Consider the example below: New data lands in Kafa and is eventually +migrated to S3 (typically in compressed columnar format, for cost efficiency and performance). +The Hybrid Source can read a stream that starts all the way in the past on S3 and transitions over +to the real-time data in Kafka. + +<figure style="align-content: center"> + <img src="{{ site.baseurl }}/img/blog/2021-09-25-release-1.14.0/hybrid_source.png" style="display: block; margin-left: auto; margin-right: auto; width: 600px"/> +</figure> + +We believe that this is an exciting step in realizing the full promise of logs and the *Kappa Architecture.* +Even if older parts of a log of events are physically migrated to different storage +(because cheaper, better compression, faster to read) you can still treat and process it as one +contiguous log. + +The Hybrid Source foundation is in Flink 1.14. Over the next releases, we expect to add more +utilities and patterns for typical switching strategies. + +## Consolidating Sources and Sink + +With the new unified (streaming/batch) source and sink APIs becoming stable, we started the +big effort to consolidate all connectors around those APIs. At the same time, are +better aligning connectors between DataStream and SQL/Table API. First are the *Kafka* and +*File* Soures and Sinks for the DataStream API. + +The result of this effort (that we expect to span at least 1-2 futher releases) will be a much +more smooth and consistent experience for Flink users when connecting to external systems; +something were we need to acknowledge that there is room for improvement in Flink. + +# Improvements to Operations + +## Buffer debloating + +*Buffer Debloating* is a new technology in Flink that automatically tunes the usage +of network memory to ensure throughput, while minimizing in-flight data and thus minimizing +checkpoint latency and cost. + +Apache Flink buffers a certain amount of data in its network stack to be able to utilize the +bandwidth of fast networks. A Flink application running under high throughput uses some of +all of that memory. Aligned checkpoints flow with the data through the network buffers in milliseconds. + +When a Flink application becomes (temporarily) backpressured (for example when being backpressured +by an external system, or when hitting skewed records), there is typically now a lot more data in +the network buffers than is necessary for the network to support the application's current throughput +(which is lowered due to backpressure). There is even an adverse effect: more buffered data means +the checkpoints need to do more work. Aligned checkpoint barriers need to wait for more data to be +processed, unaligned checkpoints need to persist more in-flight data. + +This observation that under backpressure there is more data in the network buffers than than necessary is +where *Buffer Debloating* starts: It changes the network stack from keeping up to X bytes of data +to keeping data that is worth X milliseconds of receiver computing time. With the default setting +of 1000 milliseconds, that means the network stack will buffer as much data as the receiving task can +process in 1000 milliseconds. This value is constantly measured and adjusted, so the system keeps +this characteristic under even under varying conditions. As a result, Flink can now provide Review comment: ```suggestion this characteristic even under varying conditions. As a result, Flink can now provide ``` ########## File path: _posts/2021-09-21-release-1.14.0.md ########## @@ -0,0 +1,344 @@ +--- +layout: post +title: "Apache Flink 1.14.0 Release Announcement" +date: 2021-09-21T08:00:00.000Z +categories: news +authors: +- joemoe: + name: "Johannes Moser" + +excerpt: The Apache Flink community is excited to announce the release of Flink 1.14.0! More than 200 contributor worked on over 1,000 issues. The release brings exciting new features like a more seamless streaming/batch integration, automatic network memory tuning, a hybrid source to switch data streams between storgage systems (e.g., Kafka/S3), Fine-grained resource management, PyFlink performance and debugging enhancements, and a Pulsar connector. + +--- + +The Apache Software Foundation recently released its annual report and Apache Flink once again made +it on the list of the top 5 most active projects! This remarkable +activity also shows in the new 1.14.0 release. Once again, more than 200 contributors worked on +over 1,000 issues. We are proud of how this community is consistently moving the project forward. + +This release brings many new features and improvements in areas such as the SQL API, more connector support, checkpointing, and PyFlink. +A major area of changes in this release is the integrated streaming & batch experience. We believe +that, in practice, unbounded stream processing goes hand-in-hand with bounded- and batch processing tasks in practice, +because many use cases require processing historic data from various sources alongside streaming data. +Examples are data exploration when developing new applications, bootstrapping state for new applications, training +models to be applied in a streaming application, re-processing data after fixes/upgrades, and . + +In Flink 1.14, we finally made it possible to **mix bounded and unbounded streams in an application**: +Flink now supports taking checkpoints of applications that are partially running and partially finished (some +operators reached the end of the bounded inputs). Additionally, **bounded streams now take a final checkpoint** +when reaching their end to ensure smooth committing of results in sinks. + +The **batch execution mode now supports programs that use a mixture of the DataStream API and the SQL/Table API** +(previously only pure Table/SQL or DataStream programs). + +The unified Source and Sink APIs have gotten an update, and we started **consolidating the connector ecosystem around the unified APIs**. We added a new **hybrid source** can bridge between multiple storage systems. +You can now do things like read old data from Amazon S3 and then switch over to Apache Kafka. + +Furthermore, this release takes another step in our initiative of making Flink more self-tuning and +to require less Stream-Processor-specific knowledge to operate. We started that initiative in the previous +release with [Reactive Scaling](/news/2021/05/03/release-1.13.0.html#reactive-scaling) and are now +adding **automatic network memory tuning** (*a.k.a. Buffer Debloating*). This feature speeds up checkpoints +under load without sacrificing performance or increasing checkpoint size, by continuously adjusting the Review comment: > without sacrificing performance That's unfortunately not true. It's causing a couple of % slow down, that might be hard to get rid of. What's worse, is that it looks like this performance drop is visible even in heavily back pressured jobs, with low throughputs and long time to process single record. In other words, it doesn't slow down max theoretical throughput, but potentially it might slow down every job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org