bvaradar opened a new issue, #14263: URL: https://github.com/apache/hudi/issues/14263
**Background Context** We currently have Avro Schema usage across the codebase, To have a centralized schema management, it is important to create a Hudi Schema. As part of RFC-99 Hudi Type System, this issue and its child issues tracking the phase 1: Schema Consolidation - the critical first step toward a unified Hudi type system. **Why Consolidation Matters** _Centralized Control:_ Single point of schema API management enables consistent behavior _Future Extensibility_: Wrapper approach allows adding Hudi-specific functionality without breaking existing code _Maintainability_: Easier to debug, optimize, and evolve schema operations **What We Will be consolidated** - In-Memory Schema Processing (write path and table metadata operations) - Table Schema Management: Schema creation, validation, compatibility checking for table operations. - Write Path Schema Handling: Schema processing during record writing, update, and merge operations. - Schema Evolution Operations: Schema compatibility validation, field addition/removal, type promotion. - Metadata Schema Processing: Table metadata that involves schema manipulation (NOT the persisted schema format). - Query Planning Schema: Schema operations used for query optimization and planning Specific Module Areas: org.apache.hudi.table.* - Table service schema operations org.apache.hudi.io.* - File I/O schema processing org.apache.hudi.metadata.* - Metadata service schema operations org.apache.hudi.client.* - Client schema validation and processing Schema compatibility utilities in AvroSchemaUtils → HoodieSchemaUtils **What will not be touched** - On-Disk Formats (breaking these would break Avro compatibility): - Record Serialization: Avro records written to Hoodie delta/log files remain as Avro. - Commit Metadata Storage: Schema field stored in commit metadata stays as serialized Avro schema. - Timeline Service: All timeline metadata continues using Avro serialization. - External Avro Sources: hudi-utilities Avro source connectors remain unchanged. - Parquet Schema Mapping: Parquet ↔ Avro schema conversion stays as-is **Integration Boundaries:** - Query Engine Integration: Spark/Flink type conversions remain at engine boundaries. - Schema Registry: External schema registry integrations unchanged. - Backward Compatibility: All existing APIs maintain identical behavior **Conversion Boundaries:** The key approach is explicit conversion points: - Memory → Disk: HoodieSchema.toAvroSchema() before writing to files/metadata. - Disk → Memory: HoodieSchema.fromAvroSchema() after reading from files/metadata. - Engine Boundaries: Convert at Spark/Flink integration points. **Guarantees** - Binary Compatibility to Avro in the initial implementation to make the migration seamless. - Similar Public APIs and Converters to switch to make the migration more or less mechanical. ** Development Approach ** - One main PR to introduce new Hudi Schema without integration. - Series of small incremental PRs to add new methods to be able to migrate to new Schema path. - Final PR to make a cutoff to new Schema path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
