avshenuk commented on PR #16387:
URL: https://github.com/apache/pinot/pull/16387#issuecomment-3092244437
So from what I can see the logic has changed slightly from that time, and
now from looking at `JsonUnnestIngestionFromAvroQueriesTest`, the only
difference is that
```
jsonColumn: [
{"data":{"a":"1","b":"2"},"timestamp":1719390721},
{"data":{"a":"2","b":"4"},"timestamp":1719390722}
]
```
turns into
```
jsonColumn: [
{"data":{"a":"1","b":"2"},"data.a":"1","timestamp":1719390721,"data.b":"2"},
{"data":{"a":"2","b":"4"},"data.a":"2","timestamp":1719390722,"data.b":"4"}
]
```
So as you can see, the inner maps of such a JSON object are being flattened
making the result data different from the original.
So I can see a couple of considerations here:
1. complex type config always flattens nested maps regardless of "unnest
fields" provided. Good or bad, but the user has no control over it, and for a
JSON column without arrays it could result in the same issue?
**Possible solution**: introduce a new option similar to "unnest fields" but
for flattening?
2. different semantics of complex vs simple items in collections: I can see
why one would want to preserve the original value of the collection after
unnesting, but there is that inconsistency I mentioned before - each map item
will be accessible by the dot separated names after unnesting and yet, not the
other non-Map items.
Moreover, technically, the "root" map is also lost after unnesting (which
one might need for whatever reason).
**Possible solutions**:
- Store that root item under a different name?
- Have an option to control whether to preserve the original value under a
different or the same name?
### Maybe instead we could refactor this transformer into a somewhat cleaner
more controllable state?
1. control what fields to "flatten" - only flattens maps (within collections
or not) and only for provided fields
2. control what fields to "unnest" - only unnests collections (without
flattening maps) and only for provided fields
3. control what fields to keep the original collection value after unnesting
for (need to keep the original by default for backward compatibility, but the
opposite of that would be the most logical imo).
4. combining these 3 lets you gracefully decide what end result you want to
achieve
5. allow chaining complex type configs just like transformation configs:
e.g. one config flattens, the next one unnests based on the flattened fields
and so on
Notes: the original can be preserved temporarily under the new name
"<original_name>.$ORIGINAL$" and replaced back to its original name ONLY on the
final step of `TransformPipeline`, this way, the unnested object could still be
available for further transformations by the user if needed.
### Examples (config names are subject for discussion):
Input:
```
[
{
jsonField: [
{"data":{"a":"1","b":"2"},"timestamp":1719390721},
{"data":{"a":"2","b":"4"},"timestamp":1719390722}
]
}
]
```
Config:
```
"fieldsToUnnest": "jsonField"
```
Output:
```
[
{
jsonField: {"data":{"a":"1","b":"2"},"timestamp":1719390721},
},
{
jsonField: {"data":{"a":"2","b":"4"},"timestamp":1719390722}
}
]
```
-------------------------
Config:
```
"fieldsToUnnestAndKeepOriginals": "jsonField",
```
Output:
```
[
{
jsonField.$ORIGINAL$: [ // becomes `jsonField` in the end
{"data":{"a":"1","b":"2"},"timestamp":1719390721},
{"data":{"a":"2","b":"4"},"timestamp":1719390722}
],
jsonField: {"data":{"a":"1","b":"2"},"timestamp":1719390721} // is lost
in the end
},
{
jsonField.$ORIGINAL$: [ // becomes `jsonField` in the end
{"data":{"a":"1","b":"2"},"timestamp":1719390721},
{"data":{"a":"2","b":"4"},"timestamp":1719390722}
],
jsonField: {"data":{"a":"2","b":"4"},"timestamp":1719390722} // is lost
in the end
}
]
```
----------------------------
Config 1:
```
"fieldsToFlatten": "jsonField",
"fieldsToUnnestAndKeepOriginals": "jsonField",
```
Output 1:
```
[
{
jsonField.$ORIGINAL$: [ // becomes `jsonField` in the end
{"data":{"a":"1","b":"2"},"timestamp":1719390721},
{"data":{"a":"2","b":"4"},"timestamp":1719390722}
],
jsonField: {"data":{"a":"1","b":"2"},"timestamp":1719390721}, // is lost
in the end
jsonField.data: {"a":"1","b":"2"},
jsonField.timestamp": 1719390721
},
{
jsonField.$ORIGINAL$: [ // becomes `jsonField` in the end
{"data":{"a":"1","b":"2"},"timestamp":1719390721},
{"data":{"a":"2","b":"4"},"timestamp":1719390722}
],
jsonField: {"data":{"a":"2","b":"4"},"timestamp":1719390722}, // is lost
in the end
jsonField.data: {"a":"2","b":"4"},
jsonField.timestamp": 1719390722
}
]
```
Config 2:
```
"fieldsToFlatten": "jsonField.data"
```
Output 2:
```
[
{
jsonField.$ORIGINAL$: [ // becomes `jsonField` in the end
{"data":{"a":"1","b":"2"},"timestamp":1719390721},
{"data":{"a":"2","b":"4"},"timestamp":1719390722}
],
jsonField: {"data":{"a":"1","b":"2"},"timestamp":1719390721}, // is lost
in the end
jsonField.data: {"a":"1","b":"2"},
jsonField.data.a: "1",
jsonField.data.b: "2",
jsonField.timestamp": 1719390721
},
{
jsonField.$ORIGINAL$: [ // becomes `jsonField` in the end
{"data":{"a":"1","b":"2"},"timestamp":1719390721},
{"data":{"a":"2","b":"4"},"timestamp":1719390722}
],
jsonField: {"data":{"a":"2","b":"4"},"timestamp":1719390722}, // is lost
in the end
jsonField.data: {"a":"2","b":"4"},
jsonField.data.a: "2",
jsonField.data.b: "4",
jsonField.timestamp": 1719390722
}
]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]