Re: [PR] [RFC-92] Pluggable Table Format Support [hudi]

via GitHub Mon, 24 Mar 2025 15:48:33 -0700


yihua commented on code in PR #12998:
URL: https://github.com/apache/hudi/pull/12998#discussion_r2011026118



##########
rfc/rfc-92/rfc-92.md:
##########
@@ -0,0 +1,102 @@
+
+# RFC-92: Pluggable Table Formats in Hudi
+
+## Proposers
+
+*   Balaji Varadarajan
+
+## Approvers
+
+*   Vinoth Chandar
+*   Ethan Guo
+
+## Status
+
+JIRA: <TBD>
+
+## Abstract
+
+This RFC proposes support for different backing table format implementations 
inside Hudi. For the past 4 years at-least, we have been consistently defining 
Hudi as a broader platform and software 
[stack](https://hudi.apache.org/docs/hudi_stack) that delivers much of these 
benefits. Hudi's table format makes choices specific to data lake workloads, 
allowing efficient read/write (even the recent 
[blog](https://bytearray.substack.com/p/computer-science-behind-lakehouse) from 
Vinoth), has major differences and advantages compared to other approaches. The 
community plans to centrally focus on the native Hudi storage format.
+
+However, there may be benefits to allowing other storage layouts/table formats 
to fit under Hudi's higher level functionality. This also has non-technical 
benefits of insulating the project from vendor marketing wars. Most 
contributors (such as myself) are happily part of the global Hudi open-source 
community, for the sake of just building technology.
+
+## Background
+
+Expanding further, there are plenty of valid technical reasons on why Hudi 
should allow different storage layouts, under the upper layer reader/writer and 
table services implementations.
+
+1. We have use-cases, for cloud-native/high performance implementations of 
timeline (\`HoodieTimeline\`) and metadata (\`HoodieMetadata\` interface). In 
our use-case, we would like to explore backing them using NoSQL datastore like 
DynamoDB, for ultra-low latency queries.
+2. Hudi already supports 
[different](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieStorageLayout.java)
 storage formats/layouts. Tables can be bucketed, consistent hashed or 
organized by having data laid out in order of arrival (default).
+3. Hudi already allows plug-ability and customization at various layers like 
record merger, indexes and other core write/read paths.
+4. It's very standard practice in databases to allow multiple storage backends 
(MySQL supports myISAM, innodb/btree, myrocks/lsm). This may be crucial step 
towards the database northstar vision.
+5. As a long time member of the Hudi community and open-source enthusiast, I 
think supporting other table formats, even existing ones like Apache Iceberg or 
Delta Lake, benefits those communities as well.
+    1. For e.g. the Hudi Streamer tool being used at our data lake (and 
hundreds more) for ingestion/incremental ETL can also benefit other communities.
+    2. Hudi provides out-of-box automatic table management that is manual in 
projects like Iceberg. With such an implementation, common data lake services 
can be reused across formats.
+    3. Hudi's high performance writer path can be extended to other formats 
(to the extent possible, that is not dependent on features like indexing that 
is only in Hudi's native format)
+    4. there are more such services and functionalities to be unlocked.
+
+
+Some non-technical reasons:
+1. Though Hudi is clearly defined as a platform over the years, there is so 
much vendor attention in the space for the past couple years, where Hudi is 
minimized to a table format and compared. This change will help highlight the 
value of Hudi's open software services, beyond just open formats.
+2. It may be controversial to say this. But, the project has been facing a lot 
of vendor FUD due to different vendors supporting different table formats. It 
is neither in the interest nor the business of the project community to be part 
of vendor wars. Opening up the table format layer to different implementations 
avoids these distractions for regular OSS contributors with no vendor 
interests, and helps focus on open-source software design and development.

Review Comment:
   +1 on avoiding distractions



##########
rfc/rfc-92/rfc-92.md:
##########
@@ -0,0 +1,102 @@
+
+# RFC-92: Pluggable Table Formats in Hudi
+
+## Proposers
+
+*   Balaji Varadarajan
+
+## Approvers
+
+*   Vinoth Chandar
+*   Ethan Guo
+
+## Status
+
+JIRA: <TBD>
+
+## Abstract
+
+This RFC proposes support for different backing table format implementations 
inside Hudi. For the past 4 years at-least, we have been consistently defining 
Hudi as a broader platform and software 
[stack](https://hudi.apache.org/docs/hudi_stack) that delivers much of these 
benefits. Hudi's table format makes choices specific to data lake workloads, 
allowing efficient read/write (even the recent 
[blog](https://bytearray.substack.com/p/computer-science-behind-lakehouse) from 
Vinoth), has major differences and advantages compared to other approaches. The 
community plans to centrally focus on the native Hudi storage format.
+
+However, there may be benefits to allowing other storage layouts/table formats 
to fit under Hudi's higher level functionality. This also has non-technical 
benefits of insulating the project from vendor marketing wars. Most 
contributors (such as myself) are happily part of the global Hudi open-source 
community, for the sake of just building technology.
+
+## Background
+
+Expanding further, there are plenty of valid technical reasons on why Hudi 
should allow different storage layouts, under the upper layer reader/writer and 
table services implementations.
+
+1. We have use-cases, for cloud-native/high performance implementations of 
timeline (\`HoodieTimeline\`) and metadata (\`HoodieMetadata\` interface). In 
our use-case, we would like to explore backing them using NoSQL datastore like 
DynamoDB, for ultra-low latency queries.

Review Comment:
   The community would like to work on the metaserver in the long term.  Maybe 
some of the interfaces are useful in the metaserver perspective.  For example, 
the metaserver can leverage the DynamoDB-based implementation for the timeline.



##########
rfc/rfc-92/rfc-92.md:
##########
@@ -0,0 +1,102 @@
+
+# RFC-92: Pluggable Table Formats in Hudi
+
+## Proposers
+
+*   Balaji Varadarajan
+
+## Approvers
+
+*   Vinoth Chandar
+*   Ethan Guo
+
+## Status
+
+JIRA: <TBD>
+
+## Abstract
+
+This RFC proposes support for different backing table format implementations 
inside Hudi. For the past 4 years at-least, we have been consistently defining 
Hudi as a broader platform and software 
[stack](https://hudi.apache.org/docs/hudi_stack) that delivers much of these 
benefits. Hudi's table format makes choices specific to data lake workloads, 
allowing efficient read/write (even the recent 
[blog](https://bytearray.substack.com/p/computer-science-behind-lakehouse) from 
Vinoth), has major differences and advantages compared to other approaches. The 
community plans to centrally focus on the native Hudi storage format.
+
+However, there may be benefits to allowing other storage layouts/table formats 
to fit under Hudi's higher level functionality. This also has non-technical 
benefits of insulating the project from vendor marketing wars. Most 
contributors (such as myself) are happily part of the global Hudi open-source 
community, for the sake of just building technology.
+
+## Background
+
+Expanding further, there are plenty of valid technical reasons on why Hudi 
should allow different storage layouts, under the upper layer reader/writer and 
table services implementations.
+
+1. We have use-cases, for cloud-native/high performance implementations of 
timeline (\`HoodieTimeline\`) and metadata (\`HoodieMetadata\` interface). In 
our use-case, we would like to explore backing them using NoSQL datastore like 
DynamoDB, for ultra-low latency queries.
+2. Hudi already supports 
[different](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieStorageLayout.java)
 storage formats/layouts. Tables can be bucketed, consistent hashed or 
organized by having data laid out in order of arrival (default).
+3. Hudi already allows plug-ability and customization at various layers like 
record merger, indexes and other core write/read paths.
+4. It's very standard practice in databases to allow multiple storage backends 
(MySQL supports myISAM, innodb/btree, myrocks/lsm). This may be crucial step 
towards the database northstar vision.
+5. As a long time member of the Hudi community and open-source enthusiast, I 
think supporting other table formats, even existing ones like Apache Iceberg or 
Delta Lake, benefits those communities as well.
+    1. For e.g. the Hudi Streamer tool being used at our data lake (and 
hundreds more) for ingestion/incremental ETL can also benefit other communities.
+    2. Hudi provides out-of-box automatic table management that is manual in 
projects like Iceberg. With such an implementation, common data lake services 
can be reused across formats.
+    3. Hudi's high performance writer path can be extended to other formats 
(to the extent possible, that is not dependent on features like indexing that 
is only in Hudi's native format)
+    4. there are more such services and functionalities to be unlocked.
+
+
+Some non-technical reasons:
+1. Though Hudi is clearly defined as a platform over the years, there is so 
much vendor attention in the space for the past couple years, where Hudi is 
minimized to a table format and compared. This change will help highlight the 
value of Hudi's open software services, beyond just open formats.
+2. It may be controversial to say this. But, the project has been facing a lot 
of vendor FUD due to different vendors supporting different table formats. It 
is neither in the interest nor the business of the project community to be part 
of vendor wars. Opening up the table format layer to different implementations 
avoids these distractions for regular OSS contributors with no vendor 
interests, and helps focus on open-source software design and development.
+
+  
+## **Implementation**
+
+The main implementation step here is to create abstraction called 
TableFormatPlugin which handles table format operations such as 

Review Comment:
   I also think the the implementation should NOT live within the Hudi project 
for the same reasons.



##########
rfc/rfc-92/rfc-92.md:
##########
@@ -0,0 +1,102 @@
+
+# RFC-92: Pluggable Table Formats in Hudi
+
+## Proposers
+
+*   Balaji Varadarajan
+
+## Approvers
+
+*   Vinoth Chandar
+*   Ethan Guo
+
+## Status
+
+JIRA: <TBD>
+
+## Abstract
+
+This RFC proposes support for different backing table format implementations 
inside Hudi. For the past 4 years at-least, we have been consistently defining 
Hudi as a broader platform and software 
[stack](https://hudi.apache.org/docs/hudi_stack) that delivers much of these 
benefits. Hudi's table format makes choices specific to data lake workloads, 
allowing efficient read/write (even the recent 
[blog](https://bytearray.substack.com/p/computer-science-behind-lakehouse) from 
Vinoth), has major differences and advantages compared to other approaches. The 
community plans to centrally focus on the native Hudi storage format.
+
+However, there may be benefits to allowing other storage layouts/table formats 
to fit under Hudi's higher level functionality. This also has non-technical 
benefits of insulating the project from vendor marketing wars. Most 
contributors (such as myself) are happily part of the global Hudi open-source 
community, for the sake of just building technology.
+
+## Background
+
+Expanding further, there are plenty of valid technical reasons on why Hudi 
should allow different storage layouts, under the upper layer reader/writer and 
table services implementations.
+
+1. We have use-cases, for cloud-native/high performance implementations of 
timeline (\`HoodieTimeline\`) and metadata (\`HoodieMetadata\` interface). In 
our use-case, we would like to explore backing them using NoSQL datastore like 
DynamoDB, for ultra-low latency queries.
+2. Hudi already supports 
[different](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieStorageLayout.java)
 storage formats/layouts. Tables can be bucketed, consistent hashed or 
organized by having data laid out in order of arrival (default).
+3. Hudi already allows plug-ability and customization at various layers like 
record merger, indexes and other core write/read paths.
+4. It's very standard practice in databases to allow multiple storage backends 
(MySQL supports myISAM, innodb/btree, myrocks/lsm). This may be crucial step 
towards the database northstar vision.

Review Comment:
   Technically speaking we can have a LSM-based data layout, e.g., HUDI-5454.  
We didn't purse this since it didn't provide good performance for lake 
workloads.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [RFC-92] Pluggable Table Format Support [hudi]

Reply via email to