[
https://issues.apache.org/jira/browse/NIFI-13843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893974#comment-17893974
]
Daniel Stieglitz commented on NIFI-13843:
-----------------------------------------
[~exceptionfactory] Shouldn't NIFI-13362 also be closed as the solution for
this ticket seems to solve that one also although I have not personally
confirmed this.
> Unknown fields not dropped by JSON Writer as expected by specified schema
> -------------------------------------------------------------------------
>
> Key: NIFI-13843
> URL: https://issues.apache.org/jira/browse/NIFI-13843
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 1.27.0, 2.0.0-M4
> Reporter: Pierre Villard
> Assignee: Pierre Villard
> Priority: Major
> Fix For: 2.0.0
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Consider the following use case:
> * GFF Processor, generating a JSON with 3 fields: a, b, and c
> * ConvertRecord with JSON Reader / JSON Writer
> ** Both reader and writer are configured with a schema only specifying
> fields a and b
> The expected result is a JSON that only contains fields a and b.
> We're following the below path in the code:
> * AbstractRecordProcessor (L131)
> {code:java}
> Record firstRecord = reader.nextRecord(); {code}
> In this case, the default method for nextRecord() is defined in RecordReader
> (L50)
> {code:java}
> default Record nextRecord() throws IOException, MalformedRecordException {
> return nextRecord(true, false);
> } {code}
> where we are NOT dropping the unknown fields (Java doc needs some fixing here
> as it is saying the opposite)
> We get to
> {code:java}
> writer.write(firstRecord); {code}
> which gets us to
> * WriteJsonResult (L206)
> Here, we do a check
> {code:java}
> isUseSerializeForm(record, writeSchema) {code}
> which currently returns true when it should not. Because of this we write the
> serialised form which ignores the writer schema.
> In this method isUseSerializeForm(), we do check
> {code:java}
> record.getSchema().equals(writeSchema) {code}
> But at this point record.getSchema() returns the schema defined in the reader
> which is equal to the one defined in the writer - even though the record has
> additional fields compared to the defined schema.
> The suggested fix is check is to also add a check on
> {code:java}
> record.isDropUnknownFields() {code}
> If dropUnknownFields is false, then we do not use the serialised form.
> While this does solve the issue, I'm a bit conflicted on the current
> approach. Not only this could have a performance impact (we are likely going
> to not use the serialized form as often), but it also feels like the default
> should be to ignore the unknown fields when reading the record.
> If we consider the below scenario:
> * GFF Processor, generating a JSON with 3 fields: {{{}a{}}}, {{{}b{}}}, and
> {{c}}
> * ConvertRecord with JSON Reader / JSON Writer
> ** JSON reader with a schema only specifying fields {{a}} and {{b}}
> ** JSON writer with a schema specifying fields {{{}a{}}}, {{{}b{}}}, and
> {{c}} ({{{}c{}}} defaulting to {{{}null{}}})
> It feels like the expected result should be a JSON with the field {{c}} and a
> {{null}} value, because the reader would drop the field when reading the JSON
> and converting it into a record and pass it to the writer.
> If we agree on the above, then it may be easier to juste override
> {{nextRecord()}} in {{AbstractJsonRowRecordReader}} and default to
> {{{}nextRecord(true, true){}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)