[ 
https://issues.apache.org/jira/browse/NIFI-13843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893988#comment-17893988
 ] 

Pierre Villard commented on NIFI-13843:
---------------------------------------

Hey [~dstiegli1] - given that this can be considered as a breaking change, I'd 
not recommend backporting this into 1.x.

> Unknown fields not dropped by JSON Writer as expected by specified schema
> -------------------------------------------------------------------------
>
>                 Key: NIFI-13843
>                 URL: https://issues.apache.org/jira/browse/NIFI-13843
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.27.0, 2.0.0-M4
>            Reporter: Pierre Villard
>            Assignee: Pierre Villard
>            Priority: Major
>             Fix For: 2.0.0
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Consider the following use case:
>  * GFF Processor, generating a JSON with 3 fields: a, b, and c
>  * ConvertRecord with JSON Reader / JSON Writer
>  ** Both reader and writer are configured with a schema only specifying 
> fields a and b
> The expected result is a JSON that only contains fields a and b.
> We're following the below path in the code:
>  * AbstractRecordProcessor (L131)
> {code:java}
> Record firstRecord = reader.nextRecord(); {code}
> In this case, the default method for nextRecord() is defined in RecordReader 
> (L50)
> {code:java}
> default Record nextRecord() throws IOException, MalformedRecordException {
>     return nextRecord(true, false);
> } {code}
> where we are NOT dropping the unknown fields (Java doc needs some fixing here 
> as it is saying the opposite)
> We get to 
> {code:java}
> writer.write(firstRecord); {code}
> which gets us to
>  * WriteJsonResult (L206)
> Here, we do a check
> {code:java}
> isUseSerializeForm(record, writeSchema) {code}
> which currently returns true when it should not. Because of this we write the 
> serialised form which ignores the writer schema.
> In this method isUseSerializeForm(), we do check
> {code:java}
> record.getSchema().equals(writeSchema) {code}
> But at this point record.getSchema() returns the schema defined in the reader 
> which is equal to the one defined in the writer - even though the record has 
> additional fields compared to the defined schema.
> The suggested fix is check is to also add a check on
> {code:java}
> record.isDropUnknownFields() {code}
> If dropUnknownFields is false, then we do not use the serialised form.
> While this does solve the issue, I'm a bit conflicted on the current 
> approach. Not only this could have a performance impact (we are likely going 
> to not use the serialized form as often), but it also feels like the default 
> should be to ignore the unknown fields when reading the record.
> If we consider the below scenario:
>  * GFF Processor, generating a JSON with 3 fields: {{{}a{}}}, {{{}b{}}}, and 
> {{c}}
>  * ConvertRecord with JSON Reader / JSON Writer
>  ** JSON reader with a schema only specifying fields {{a}} and {{b}}
>  ** JSON writer with a schema specifying fields {{{}a{}}}, {{{}b{}}}, and 
> {{c}} ({{{}c{}}} defaulting to {{{}null{}}})
> It feels like the expected result should be a JSON with the field {{c}} and a 
> {{null}} value, because the reader would drop the field when reading the JSON 
> and converting it into a record and pass it to the writer.
> If we agree on the above, then it may be easier to juste override 
> {{nextRecord()}} in {{AbstractJsonRowRecordReader}} and default to 
> {{{}nextRecord(true, true){}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to