[
https://issues.apache.org/jira/browse/NIFI-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josiah Johnston updated NIFI-9462:
----------------------------------
Description:
I use GenerateFlowFile with this JSON line content, and send it through an
UpdateRecord processor that uses a JsonTreeReader with schema inference. The
UpdateRecord processor adds a top level `s3_key` element with the filename.
{"_source": \{"name": "battery-voltage-changed", "metadata": {"voltage": 2.8}}}
{"_source": \{"name": "temperature-changed", "metadata": {"temperature":
19.54}}}
{"other_L1_keys": "are_preserved", "_source": \{"other_L2_keys":
"are_preserved", "metadata": {"voltage": 6.3, "other_L3_keys": "are_lost"}}}
In the output, the structure of `_source.metadata.*` is always strictly based
on the first record, causing data loss for subsequent records that have
different fields.
{"_source":\{"name":"battery-voltage-changed","metadata":{"voltage":2.8},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
{"_source":\{"name":"temperature-changed","metadata":{"voltage":null},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
{"_source":\{"name":null,"metadata":{"voltage":6.3},"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
In general it drops all 3rd level keys weren't seen in the first record
(_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in
record 3). This behavior only applies to keys in the 3rd level; schema
inference works as documented (scanning through all records) for alternative
keys in the 1st & 2nd level.
This behavior persists whether I specify the input as JSON lines (shown in this
example), or if I rearrange it to be a JSON array.
I've attached screenshots of a minimal example and settings of JSON reader &
writer.
was:
I use GenerateFlowFile with this JSON line content, and send it through an
UpdateRecord processor that uses a JsonTreeReader with schema inference. The
UpdateRecord processor adds a top level `s3_key` element with the filename.
{quote}{"_source": \{"name": "battery-voltage-changed", "metadata": {"voltage":
2.8}}}
{"_source": \{"name": "temperature-changed", "metadata": {"temperature":
19.54}}}
{"other_L1_keys": "are_preserved", "_source": \{"other_L2_keys":
"are_preserved", "metadata": {"voltage": 6.3, "other_L3_keys": "are_lost"}}}
{quote}
In the output, the structure of `_source.metadata.*` is always strictly based
on the first record, causing data loss for subsequent records that have
different fields.
{quote}{"_source":\{"name":"battery-voltage-changed","metadata":{"voltage":2.8},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
{"_source":\{"name":"temperature-changed","metadata":{"voltage":null},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
{"_source":\{"name":null,"metadata":{"voltage":6.3},"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
{quote}
In general it drops all 3rd level keys weren't seen in the first record
(_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in
record 3). This behavior only applies to keys in the 3rd level; schema
inference works as documented (scanning through all records) for alternative
keys in the 1st & 2nd level.
This behavior persists whether I specify the input as JSON lines (shown in this
example), or if I rearrange it to be a JSON array.
I've attached screenshots of a minimal example and settings of JSON reader &
writer.
> JsonTreeReader schema inference only examines first record for parts of
> structure, causing data loss for subsequent records
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-9462
> URL: https://issues.apache.org/jira/browse/NIFI-9462
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.13.0
> Reporter: Josiah Johnston
> Priority: Major
> Attachments: JSON record set writer.png, JSON tree reader.png,
> flow.png, updateRecord.png
>
>
> I use GenerateFlowFile with this JSON line content, and send it through an
> UpdateRecord processor that uses a JsonTreeReader with schema inference. The
> UpdateRecord processor adds a top level `s3_key` element with the filename.
> {"_source": \{"name": "battery-voltage-changed", "metadata": {"voltage":
> 2.8}}}
> {"_source": \{"name": "temperature-changed", "metadata": {"temperature":
> 19.54}}}
> {"other_L1_keys": "are_preserved", "_source": \{"other_L2_keys":
> "are_preserved", "metadata": {"voltage": 6.3, "other_L3_keys": "are_lost"}}}
> In the output, the structure of `_source.metadata.*` is always strictly based
> on the first record, causing data loss for subsequent records that have
> different fields.
> {"_source":\{"name":"battery-voltage-changed","metadata":{"voltage":2.8},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":\{"name":"temperature-changed","metadata":{"voltage":null},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":\{"name":null,"metadata":{"voltage":6.3},"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> In general it drops all 3rd level keys weren't seen in the first record
> (_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in
> record 3). This behavior only applies to keys in the 3rd level; schema
> inference works as documented (scanning through all records) for alternative
> keys in the 1st & 2nd level.
> This behavior persists whether I specify the input as JSON lines (shown in
> this example), or if I rearrange it to be a JSON array.
> I've attached screenshots of a minimal example and settings of JSON reader &
> writer.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)