bhasudha commented on a change in pull request #4225: URL: https://github.com/apache/hudi/pull/4225#discussion_r763258156
########## File path: website/docs/use_cases.md ########## @@ -79,3 +53,86 @@ To achieve this, Hudi has embraced similar concepts from stream processing frame [Flink](https://flink.apache.org) or database replication technologies like [Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187). For the more curious, a more detailed explanation of the benefits of Incremental Processing can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop) +### Unified Batch and Streaming + +The world we live in is polarized - even on data analytics storage - into real-time and offline/batch storage. Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) +are powered by specialized analytical stores such as [Druid](http://druid.io/) or [Memsql](http://www.memsql.com/) or [Clickhouse](https://clickhouse.tech/), fed by event buses like +[Kafka](https://kafka.apache.org) or [Pulsar](https://pulsar.apache.org). This model is prohibitively expensive, unless a small fraction of your data lake data +needs sub-second query responses such as system monitoring or interactive real-time analysis. + +The same data gets ingested into data lake storage much later (say every few hours or so) and then runs through batch ETL pipelines, with intolerable data freshness +to do any kind of near-realtime analytics. On the other hand, the data lakes provide access to interactive SQL engines like Presto/SparkSQL, which can horizontally scale +easily and provide return even more complex queries, within few seconds. + +By bringing streaming primitives to data lake storage, Hudi opens up new possibilities by being able to ingest data within few minutes and also author incremental data +pipelines that are orders of magnitude faster than traditional batch processing. By bringing __data freshness to a few minutes__, Hudi can provide a much efficient alternative, +for a large class of data applications, compared to real-time datamarts. Also, Hudi has no upfront server infrastructure investments +and thus enables faster analytics on much fresher analytics, without increasing the operational overhead. This external [article](https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/) +further validates this newer model. + +## Cloud-Native Tables +Apache Hudi makes it easy to define tables, manage schema, metadata, and bring SQL semantics to cloud file storage. +Some may first hear about Hudi as an "open table format" and this is true, but it is also just one small layer the full Hudi stack. Review comment: "small layer of the full Hudi stack" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org