[ 
https://issues.apache.org/jira/browse/NIFI-13843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Villard updated NIFI-13843:
----------------------------------
    Description: 
Consider the following use case:
 * GFF Processor, generating a JSON with 3 fields: a, b, and c
 * ConvertRecord with JSON Reader / JSON Writer
 ** Both reader and writer are configured with a schema only specifying fields 
a and b

The expected result is a JSON that only contains fields a and b.

We're following the below path in the code:
 * AbstractRecordProcessor (L131)

{code:java}
Record firstRecord = reader.nextRecord(); {code}
In this case, the default method for nextRecord() is defined in RecordReader 
(L50)
{code:java}
default Record nextRecord() throws IOException, MalformedRecordException {
    return nextRecord(true, false);
} {code}
where we are NOT dropping the unknown fields (Java doc needs some fixing here 
as it is saying the opposite)

We get to 
{code:java}
writer.write(firstRecord); {code}
which gets us to
 * WriteJsonResult (L206)

Here, we do a check
{code:java}
isUseSerializeForm(record, writeSchema) {code}
which currently returns true when it should not. Because of this we write the 
serialised form which ignores the writer schema.

In this method isUseSerializeForm(), we do check
{code:java}
record.getSchema().equals(writeSchema) {code}
But at this point record.getSchema() returns the schema defined in the reader 
which is equal to the one defined in the writer - even though the record has 
additional fields compared to the defined schema.

The suggested fix is check is to also add a check on
{code:java}
record.isDropUnknownFields() {code}
If dropUnknownFields is false, then we do not use the serialised form.

While this does solve the issue, I'm a bit conflicted on the current approach. 
Not only this could have a performance impact (we are likely going to not use 
the serialized form as often), but it also feels like the default should be to 
ignore the unknown fields when reading the record.

If we consider the below scenario:
 * GFF Processor, generating a JSON with 3 fields: {{{}a{}}}, {{{}b{}}}, and 
{{c}}
 * ConvertRecord with JSON Reader / JSON Writer
 ** JSON reader with a schema only specifying fields {{a}} and {{b}}
 ** JSON writer with a schema specifying fields {{{}a{}}}, {{{}b{}}}, and {{c}} 
({{{}c{}}} defaulting to {{{}null{}}})

It feels like the expected result should be a JSON with the field {{c}} and a 
{{null}} value, because the reader would drop the field when reading the JSON 
and converting it into a record and pass it to the writer.

If we agree on the above, then it may be easier to juste override 
{{nextRecord()}} in {{AbstractJsonRowRecordReader}} and default to 
{{{}nextRecord(true, true){}}}.

  was:
Consider the following use case:
 * GFF Processor, generating a JSON with 3 fields: a, b, and c
 * ConvertRecord with JSON Reader / JSON Writer
 ** Both reader and writer are configured with a schema only specifying fields 
a and b

The expected result is a JSON that only contains fields a and b.

We're following the below path in the code:
 * AbstractRecordProcessor (L131)

{code:java}
Record firstRecord = reader.nextRecord(); {code}
In this case, the default method for nextRecord() is defined in RecordReader 
(L50)
{code:java}
default Record nextRecord() throws IOException, MalformedRecordException {
    return nextRecord(true, false);
} {code}
where we are NOT dropping the unknown fields (Java doc needs some fixing here 
as it is saying the opposite)

We get to 
{code:java}
writer.write(firstRecord); {code}
which gets us to
 * WriteJsonResult (L206)

Here, we do a check
{code:java}
isUseSerializeForm(record, writeSchema) {code}
which currently returns true when it should not. Because of this we write the 
serialised form which ignores the writer schema.

In this method isUseSerializeForm(), we do check
{code:java}
record.getSchema().equals(writeSchema) {code}
But at this point record.getSchema() returns the schema defined in the reader 
which is equal to the one defined in the writer - even though the record has 
additional fields compared to the defined schema.

The suggested fix is check is to also add a check on
{code:java}
record.isDropUnknownFields() {code}
If dropUnknownFields is false, then we do not use the serialised form.

 


> Unknown fields not dropped by JSON Writer as expected by specified schema
> -------------------------------------------------------------------------
>
>                 Key: NIFI-13843
>                 URL: https://issues.apache.org/jira/browse/NIFI-13843
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.27.0, 2.0.0-M4
>            Reporter: Pierre Villard
>            Assignee: Pierre Villard
>            Priority: Major
>
> Consider the following use case:
>  * GFF Processor, generating a JSON with 3 fields: a, b, and c
>  * ConvertRecord with JSON Reader / JSON Writer
>  ** Both reader and writer are configured with a schema only specifying 
> fields a and b
> The expected result is a JSON that only contains fields a and b.
> We're following the below path in the code:
>  * AbstractRecordProcessor (L131)
> {code:java}
> Record firstRecord = reader.nextRecord(); {code}
> In this case, the default method for nextRecord() is defined in RecordReader 
> (L50)
> {code:java}
> default Record nextRecord() throws IOException, MalformedRecordException {
>     return nextRecord(true, false);
> } {code}
> where we are NOT dropping the unknown fields (Java doc needs some fixing here 
> as it is saying the opposite)
> We get to 
> {code:java}
> writer.write(firstRecord); {code}
> which gets us to
>  * WriteJsonResult (L206)
> Here, we do a check
> {code:java}
> isUseSerializeForm(record, writeSchema) {code}
> which currently returns true when it should not. Because of this we write the 
> serialised form which ignores the writer schema.
> In this method isUseSerializeForm(), we do check
> {code:java}
> record.getSchema().equals(writeSchema) {code}
> But at this point record.getSchema() returns the schema defined in the reader 
> which is equal to the one defined in the writer - even though the record has 
> additional fields compared to the defined schema.
> The suggested fix is check is to also add a check on
> {code:java}
> record.isDropUnknownFields() {code}
> If dropUnknownFields is false, then we do not use the serialised form.
> While this does solve the issue, I'm a bit conflicted on the current 
> approach. Not only this could have a performance impact (we are likely going 
> to not use the serialized form as often), but it also feels like the default 
> should be to ignore the unknown fields when reading the record.
> If we consider the below scenario:
>  * GFF Processor, generating a JSON with 3 fields: {{{}a{}}}, {{{}b{}}}, and 
> {{c}}
>  * ConvertRecord with JSON Reader / JSON Writer
>  ** JSON reader with a schema only specifying fields {{a}} and {{b}}
>  ** JSON writer with a schema specifying fields {{{}a{}}}, {{{}b{}}}, and 
> {{c}} ({{{}c{}}} defaulting to {{{}null{}}})
> It feels like the expected result should be a JSON with the field {{c}} and a 
> {{null}} value, because the reader would drop the field when reading the JSON 
> and converting it into a record and pass it to the writer.
> If we agree on the above, then it may be easier to juste override 
> {{nextRecord()}} in {{AbstractJsonRowRecordReader}} and default to 
> {{{}nextRecord(true, true){}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to