pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy
default values of fields if not present when rewriting incoming record with new
schema
URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r397658217
##########
File path:
hudi-common/src/test/java/org/apache/hudi/common/util/TestHoodieAvroUtils.java
##########
@@ -57,4 +60,16 @@ public void testPropsPresent() {
}
Assert.assertTrue("column pii_col doesn't show up", piiPresent);
}
+
+ @Test
+ public void testDefaultValue() {
+ GenericRecord rec = new GenericData.Record(new
Schema.Parser().parse(EXAMPLE_SCHEMA));
+ rec.put("_row_key", "key1");
+ rec.put("non_pii_col", "val1");
+ rec.put("pii_col", "val2");
+ rec.put("timestamp", 3.5);
Review comment:
> conversion to avro is internal to Hudi and a custom avro schema (with
default values) is not something that user can themselves pass
I did not understand this. As a user I can always specify the schema that is
to be used either via FileBasedSchemaProvider or using schema registry.
Let me give you an example. Suppose there is some table with schema S1 and
you have published some records (R1 and R2) with this schema into kafka. Next
you evolve the schema (it now becomes S2) and a new nullable field is added as
below ->
{"name": "col1", "type":["string", "null"], "default": "dummy"}
you again publish some records (R3 and R4) with S2 and now start consuming
with delta streamer. So your kafka topic is having records with both the
schemas and delta streamer is using S2 as target schema. Now while writing to
parquet, I want R1 and R2 to be written with this default value "dummy" for
field "col1", which is a pretty common case. Generally users prefer to have
some default value for newly added fields rather than having written them as
null. How do you achieve this without this PR?
Open to hearing your thoughts on this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services