bvaradar opened a new issue, #14263:
URL: https://github.com/apache/hudi/issues/14263

   **Background Context**
   
   We currently have  Avro Schema usage across the codebase, To have a 
centralized schema management, it is important to create a Hudi Schema. As part 
of RFC-99 Hudi Type System, this issue and its child issues tracking the  phase 
1: Schema Consolidation - the critical first step toward a unified Hudi type 
system.
   
   **Why Consolidation Matters**
   _Centralized Control:_ Single point of schema API management enables 
consistent behavior
   _Future Extensibility_: Wrapper approach allows adding Hudi-specific 
functionality without breaking existing code
   _Maintainability_: Easier to debug, optimize, and evolve schema operations
   
   **What We Will be consolidated**
   
   - In-Memory Schema Processing (write path and table metadata operations)
   - Table Schema Management: Schema creation, validation, compatibility 
checking for table operations.
   - Write Path Schema Handling: Schema processing during record writing, 
update, and merge operations.
   - Schema Evolution Operations: Schema compatibility validation, field 
addition/removal, type promotion.
   -  Metadata Schema Processing: Table metadata that involves schema 
manipulation (NOT the persisted schema format).
   - Query Planning Schema: Schema operations used for query optimization and 
planning
   
   Specific Module Areas:
   org.apache.hudi.table.* - Table service schema operations
   org.apache.hudi.io.* - File I/O schema processing
   org.apache.hudi.metadata.* - Metadata service schema operations
   org.apache.hudi.client.* - Client schema validation and processing
   Schema compatibility utilities in AvroSchemaUtils → HoodieSchemaUtils
   
   
   **What will not be touched**
   
   - On-Disk Formats (breaking these would break Avro compatibility):
   - Record Serialization: Avro records written to Hoodie delta/log files 
remain as Avro.
   - Commit Metadata Storage: Schema field stored in commit metadata stays as 
serialized Avro schema.
   - Timeline Service: All timeline metadata continues using Avro serialization.
   - External Avro Sources: hudi-utilities Avro source connectors remain 
unchanged.
   - Parquet Schema Mapping: Parquet ↔ Avro schema conversion stays as-is
   
   **Integration Boundaries:**
   
   - Query Engine Integration: Spark/Flink type conversions remain at engine 
boundaries.
   - Schema Registry: External schema registry integrations unchanged.
   - Backward Compatibility: All existing APIs maintain identical behavior
   
   **Conversion Boundaries:**
   The key approach is explicit conversion points:
   - Memory → Disk: HoodieSchema.toAvroSchema() before writing to 
files/metadata.
   - Disk → Memory: HoodieSchema.fromAvroSchema() after reading from 
files/metadata.
   - Engine Boundaries: Convert at Spark/Flink integration points.
   
   **Guarantees**
   - Binary Compatibility to Avro in the initial implementation to make the 
migration seamless.
   - Similar Public APIs and Converters to switch to make the migration more or 
less mechanical. 
   
   ** Development Approach **
   - One main PR to introduce new Hudi Schema without integration. 
   - Series of small incremental PRs to add new methods to be able to migrate 
to new Schema path. 
   -  Final PR to make a cutoff to new Schema path.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to