balaji-varadarajan-ai commented on code in PR #18274:
URL: https://github.com/apache/hudi/pull/18274#discussion_r2912824867


##########
rfc/rfc-99/rfc-99.md:
##########
@@ -209,4 +209,299 @@ SQL Extensions needs to be added to define the table in a 
hudi type native way.
 
 TODO: There is an open question regarding the need to maintain type ids to 
track schema evolution and how it would interplay with NBCC. 
 
-The main implementation change would require replacing the Avro schema 
references with the new type system. 
+The main implementation change would require replacing the Avro schema 
references with the new type system.
+
+---
+
+## Variant Type Implementation
+
+This section documents the implementation of the VARIANT type in Hudi, which 
provides first-class support for semi-structured data (e.g., JSON). The Variant 
type is implemented following Spark 4.0's native VariantType specification.
+
+### Overview
+
+The Variant type enables Hudi to store and query semi-structured data 
efficiently. It is particularly useful for:
+- Schema-on-read flexibility for evolving data structures
+- Storing JSON-like data without requiring predefined schemas
+
+### Architecture
+
+Variant support is built on a **layered architecture** with version-specific 
adapters:
+
+```
+┌────────────────────────────────────────────────────┐
+│            Application Layer (Spark SQL)           │
+│    SELECT parse_json('{"a": 1}') as data           │
+└────────────────────────────────────────────────────┘
+                        │
+                        ▼
+┌────────────────────────────────────────────────────┐
+│              Spark Version Adapters                │
+│  ┌──────────────────┐  ┌────────────────────────┐  │
+│  │ BaseSpark3Adapter│  │   BaseSpark4Adapter    │  │
+│  │ (No Variant)     │  │   (Full Variant)       │  │
+│  └──────────────────┘  └────────────────────────┘  │
+└────────────────────────────────────────────────────┘
+                        │
+                        ▼
+┌────────────────────────────────────────────────────┐
+│             HoodieSchema.Variant                   │
+│     (Avro Logical Type + Record Schema)            │
+└────────────────────────────────────────────────────┘
+                        │
+                        ▼
+┌────────────────────────────────────────────────────┐
+│              Parquet Storage                       │
+│    GROUP { value: BINARY, metadata: BINARY }       │
+└────────────────────────────────────────────────────┘
+```
+
+### Variant Schema Definition
+
+The `HoodieSchema.Variant` class in `hudi-common` defines the Variant type:
+
+```java
+public static class Variant extends HoodieSchema {
+    private static final String VARIANT_METADATA_FIELD = "metadata";
+    private static final String VARIANT_VALUE_FIELD = "value";
+    private static final String VARIANT_TYPED_VALUE_FIELD = "typed_value";
+
+    private final boolean isShredded;
+    private final Option<HoodieSchema> typedValueSchema;
+}
+```
+
+#### Two Storage Modes
+
+1. **Unshredded Variant** (Default):
+   - Created with: `HoodieSchema.createVariant()`
+   - Structure: Record with two REQUIRED binary fields
+   - Fields: `metadata` (BYTES, REQUIRED), `value` (BYTES, REQUIRED)
+   - Use case: Simple semi-structured data storage
+
+2. **Shredded Variant** (Future Enhancement):
+   - Created with: `HoodieSchema.createVariantShredded(typedValueSchema)`
+   - Structure: Record with optional `typed_value` field
+   - Fields: `value` (BYTES, OPTIONAL), `metadata` (BYTES, REQUIRED), 
`typed_value` (optional)
+   - Use case: Schema evolution where certain fields are extracted and typed 
for optimized access
+
+#### Custom Avro Logical Type
+
+Variant uses a custom Avro logical type for identification:
+
+```java
+public static class VariantLogicalType extends LogicalType {
+    private static final String VARIANT_LOGICAL_TYPE_NAME = "variant";
+}
+```
+
+### On-Disk Representation (Parquet)
+
+Variant data is stored in Parquet as a GROUP type with binary fields:
+
+```
+message schema {
+  required group variant_column {
+    required binary value;
+    required binary metadata;
+  }
+}
+```
+
+#### Binary Format
+
+The Variant type follows Spark 4.0's internal binary representation:
+
+| Component | Description |
+|-----------|-------------|
+| **value** | Binary encoding of the actual data (scalars, objects, arrays) |
+| **metadata** | Dictionary of field names and type information for efficient 
access |
+
+Example for `{"updated": true, "new_field": 123}`:
+
+```
+Value Bytes:   [0x02, 0x02, 0x01, 0x00, 0x01, 0x00, 0x03, 0x04, 0x0C, 0x7B]
+Metadata Bytes: [0x01, 0x02, 0x00, 0x07, 0x10, "updated", "new_field"]
+```
+
+The metadata contains a dictionary of all field names, while the value 
contains references to these fields plus the actual data values.
+
+### Schema Evolution Support
+
+Variant types provide **schema-on-read** flexibility:
+
+| Aspect | Behavior |
+|--------|----------|
+| Adding new fields | ✅ Supported - New JSON fields can be added without 
schema changes |
+| Removing fields | ✅ Supported - Missing fields return null on read |
+| Type changes within JSON | ✅ Supported - Variant can store any 
JSON-compatible type |
+| Table schema evolution | ✅ Supported - Variant column can be added to 
existing tables |
+| Hudi schema evolution | ✅ Supported - Works with Hudi's standard schema 
evolution |
+
+**Important**: The schema flexibility is within the Variant column itself. The 
table-level schema (including the Variant column definition) still follows 
Hudi's standard schema evolution rules.
+
+### Column Statistics and Indexing
+
+| Feature | Support Status |
+|---------|----------------|

Review Comment:
   @voonhous : How does this level of support compare with other lakehouse 
solutions ? Do they support column-stats ? 



##########
rfc/rfc-99/rfc-99.md:
##########
@@ -209,4 +209,299 @@ SQL Extensions needs to be added to define the table in a 
hudi type native way.
 
 TODO: There is an open question regarding the need to maintain type ids to 
track schema evolution and how it would interplay with NBCC. 
 
-The main implementation change would require replacing the Avro schema 
references with the new type system. 
+The main implementation change would require replacing the Avro schema 
references with the new type system.
+
+---
+
+## Variant Type Implementation
+
+This section documents the implementation of the VARIANT type in Hudi, which 
provides first-class support for semi-structured data (e.g., JSON). The Variant 
type is implemented following Spark 4.0's native VariantType specification.
+
+### Overview
+
+The Variant type enables Hudi to store and query semi-structured data 
efficiently. It is particularly useful for:
+- Schema-on-read flexibility for evolving data structures
+- Storing JSON-like data without requiring predefined schemas
+
+### Architecture
+
+Variant support is built on a **layered architecture** with version-specific 
adapters:
+
+```
+┌────────────────────────────────────────────────────┐
+│            Application Layer (Spark SQL)           │
+│    SELECT parse_json('{"a": 1}') as data           │
+└────────────────────────────────────────────────────┘
+                        │
+                        ▼
+┌────────────────────────────────────────────────────┐
+│              Spark Version Adapters                │
+│  ┌──────────────────┐  ┌────────────────────────┐  │
+│  │ BaseSpark3Adapter│  │   BaseSpark4Adapter    │  │
+│  │ (No Variant)     │  │   (Full Variant)       │  │
+│  └──────────────────┘  └────────────────────────┘  │
+└────────────────────────────────────────────────────┘
+                        │
+                        ▼
+┌────────────────────────────────────────────────────┐
+│             HoodieSchema.Variant                   │
+│     (Avro Logical Type + Record Schema)            │
+└────────────────────────────────────────────────────┘
+                        │
+                        ▼
+┌────────────────────────────────────────────────────┐
+│              Parquet Storage                       │
+│    GROUP { value: BINARY, metadata: BINARY }       │
+└────────────────────────────────────────────────────┘
+```
+
+### Variant Schema Definition
+
+The `HoodieSchema.Variant` class in `hudi-common` defines the Variant type:
+
+```java
+public static class Variant extends HoodieSchema {
+    private static final String VARIANT_METADATA_FIELD = "metadata";
+    private static final String VARIANT_VALUE_FIELD = "value";
+    private static final String VARIANT_TYPED_VALUE_FIELD = "typed_value";
+
+    private final boolean isShredded;

Review Comment:
   Is changing from shredded to unshredded and vice-versa considered backwards 
incompatible change for the table ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to