[GitHub] [hudi] bhasudha commented on a change in pull request #4225: [HUDI-2922] - Docs - Improve Hudi use cases

GitBox Mon, 06 Dec 2021 10:19:12 -0800


bhasudha commented on a change in pull request #4225:
URL: https://github.com/apache/hudi/pull/4225#discussion_r763258156




##########
File path: website/docs/use_cases.md
##########
@@ -79,3 +53,86 @@ To achieve this, Hudi has embraced similar concepts from 
stream processing frame
 [Flink](https://flink.apache.org) or database replication technologies like 
[Oracle 
XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187).
 For the more curious, a more detailed explanation of the benefits of 
Incremental Processing can be found 
[here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)
 
+### Unified Batch and Streaming
+
+The world we live in is polarized - even on data analytics storage - into 
real-time and offline/batch storage. Typically, real-time 
[datamarts](https://en.wikipedia.org/wiki/Data_mart)
+are powered by specialized analytical stores such as [Druid](http://druid.io/) 
or [Memsql](http://www.memsql.com/) or [Clickhouse](https://clickhouse.tech/), 
fed by event buses like
+[Kafka](https://kafka.apache.org) or [Pulsar](https://pulsar.apache.org). This 
model is prohibitively expensive, unless a small fraction of your data lake data
+needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
+
+The same data gets ingested into data lake storage much later (say every few 
hours or so) and then runs through batch ETL pipelines, with intolerable data 
freshness
+to do any kind of near-realtime analytics. On the other hand, the data lakes 
provide access to interactive SQL engines like Presto/SparkSQL, which can 
horizontally scale
+easily and provide return even more complex queries, within few seconds.
+
+By bringing streaming primitives to data lake storage, Hudi opens up new 
possibilities by being able to ingest data within few minutes and also author 
incremental data
+pipelines that are orders of magnitude faster than traditional batch 
processing. By bringing __data freshness to a few minutes__, Hudi can provide a 
much efficient alternative,
+for a large class of data applications, compared to real-time datamarts. Also, 
Hudi has no upfront server infrastructure investments
+and thus enables faster analytics on much fresher analytics, without 
increasing the operational overhead. This external 
[article](https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/)
+further validates this newer model.
+
+## Cloud-Native Tables
+Apache Hudi makes it easy to define tables, manage schema, metadata, and bring 
SQL semantics to cloud file storage.
+Some may first hear about Hudi as an "open table format" and this is true, but 
it is also just one small layer the full Hudi stack.

Review comment:
       "small layer of the full Hudi stack"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha commented on a change in pull request #4225: [HUDI-2922] - Docs - Improve Hudi use cases

Reply via email to