kywe665 commented on a change in pull request #4225: URL: https://github.com/apache/hudi/pull/4225#discussion_r763528662
########## File path: website/docs/use_cases.md ########## @@ -6,12 +6,14 @@ toc: true last_modified_at: 2019-12-30T15:59:57-04:00 --- -## Near Real-Time Ingestion +Apache Hudi provides the foundational features required to build a state-of-the-art Lakehouse. +The following are examples of use cases for why many choose to use Apache Hudi: -Hudi offers some great benefits across ingestion of all kinds. Hudi helps __enforces a minimum file size on DFS__. This helps -solve the ["small files problem"](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) for HDFS and Cloud Stores alike, -significantly improving query performance. Hudi adds the much needed ability to atomically commit new data, shielding queries from -ever seeing partial writes and helping ingestion recover gracefully from failures. +## A Streaming Data Lake +As outlined in depth in this blog post, https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform, Apache Hudi Review comment: good call, i adjusted ########## File path: website/docs/use_cases.md ########## @@ -79,3 +53,86 @@ To achieve this, Hudi has embraced similar concepts from stream processing frame [Flink](https://flink.apache.org) or database replication technologies like [Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187). For the more curious, a more detailed explanation of the benefits of Incremental Processing can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop) +### Unified Batch and Streaming + +The world we live in is polarized - even on data analytics storage - into real-time and offline/batch storage. Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) +are powered by specialized analytical stores such as [Druid](http://druid.io/) or [Memsql](http://www.memsql.com/) or [Clickhouse](https://clickhouse.tech/), fed by event buses like +[Kafka](https://kafka.apache.org) or [Pulsar](https://pulsar.apache.org). This model is prohibitively expensive, unless a small fraction of your data lake data +needs sub-second query responses such as system monitoring or interactive real-time analysis. + +The same data gets ingested into data lake storage much later (say every few hours or so) and then runs through batch ETL pipelines, with intolerable data freshness +to do any kind of near-realtime analytics. On the other hand, the data lakes provide access to interactive SQL engines like Presto/SparkSQL, which can horizontally scale +easily and provide return even more complex queries, within few seconds. + +By bringing streaming primitives to data lake storage, Hudi opens up new possibilities by being able to ingest data within few minutes and also author incremental data +pipelines that are orders of magnitude faster than traditional batch processing. By bringing __data freshness to a few minutes__, Hudi can provide a much efficient alternative, +for a large class of data applications, compared to real-time datamarts. Also, Hudi has no upfront server infrastructure investments +and thus enables faster analytics on much fresher analytics, without increasing the operational overhead. This external [article](https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/) +further validates this newer model. + +## Cloud-Native Tables +Apache Hudi makes it easy to define tables, manage schema, metadata, and bring SQL semantics to cloud file storage. +Some may first hear about Hudi as an "open table format" and this is true, but it is also just one small layer the full Hudi stack. Review comment: updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
