[GitHub] [hudi] kywe665 commented on a change in pull request #4225: [HUDI-2922] - Docs - Improve Hudi use cases

GitBox Mon, 06 Dec 2021 16:34:10 -0800


kywe665 commented on a change in pull request #4225:
URL: https://github.com/apache/hudi/pull/4225#discussion_r763528662




##########
File path: website/docs/use_cases.md
##########
@@ -6,12 +6,14 @@ toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-## Near Real-Time Ingestion
+Apache Hudi provides the foundational features required to build a 
state-of-the-art Lakehouse. 
+The following are examples of use cases for why many choose to use Apache Hudi:
 
-Hudi offers some great benefits across ingestion of all kinds. Hudi helps 
__enforces a minimum file size on DFS__. This helps
-solve the ["small files 
problem"](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) for 
HDFS and Cloud Stores alike,
-significantly improving query performance. Hudi adds the much needed ability 
to atomically commit new data, shielding queries from
-ever seeing partial writes and helping ingestion recover gracefully from 
failures.
+## A Streaming Data Lake
+As outlined in depth in this blog post, 
https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform, Apache 
Hudi 

Review comment:
       good call, i adjusted

##########
File path: website/docs/use_cases.md
##########
@@ -79,3 +53,86 @@ To achieve this, Hudi has embraced similar concepts from 
stream processing frame
 [Flink](https://flink.apache.org) or database replication technologies like 
[Oracle 
XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187).
 For the more curious, a more detailed explanation of the benefits of 
Incremental Processing can be found 
[here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)
 
+### Unified Batch and Streaming
+
+The world we live in is polarized - even on data analytics storage - into 
real-time and offline/batch storage. Typically, real-time 
[datamarts](https://en.wikipedia.org/wiki/Data_mart)
+are powered by specialized analytical stores such as [Druid](http://druid.io/) 
or [Memsql](http://www.memsql.com/) or [Clickhouse](https://clickhouse.tech/), 
fed by event buses like
+[Kafka](https://kafka.apache.org) or [Pulsar](https://pulsar.apache.org). This 
model is prohibitively expensive, unless a small fraction of your data lake data
+needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
+
+The same data gets ingested into data lake storage much later (say every few 
hours or so) and then runs through batch ETL pipelines, with intolerable data 
freshness
+to do any kind of near-realtime analytics. On the other hand, the data lakes 
provide access to interactive SQL engines like Presto/SparkSQL, which can 
horizontally scale
+easily and provide return even more complex queries, within few seconds.
+
+By bringing streaming primitives to data lake storage, Hudi opens up new 
possibilities by being able to ingest data within few minutes and also author 
incremental data
+pipelines that are orders of magnitude faster than traditional batch 
processing. By bringing __data freshness to a few minutes__, Hudi can provide a 
much efficient alternative,
+for a large class of data applications, compared to real-time datamarts. Also, 
Hudi has no upfront server infrastructure investments
+and thus enables faster analytics on much fresher analytics, without 
increasing the operational overhead. This external 
[article](https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/)
+further validates this newer model.
+
+## Cloud-Native Tables
+Apache Hudi makes it easy to define tables, manage schema, metadata, and bring 
SQL semantics to cloud file storage.
+Some may first hear about Hudi as an "open table format" and this is true, but 
it is also just one small layer the full Hudi stack.

Review comment:
       updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kywe665 commented on a change in pull request #4225: [HUDI-2922] - Docs - Improve Hudi use cases

Reply via email to