avshenuk commented on PR #16387:
URL: https://github.com/apache/pinot/pull/16387#issuecomment-3092244437

   So from what I can see the logic has changed slightly from that time, and 
now from looking at `JsonUnnestIngestionFromAvroQueriesTest`, the only 
difference is that
   ```
   jsonColumn: [
     {"data":{"a":"1","b":"2"},"timestamp":1719390721},
     {"data":{"a":"2","b":"4"},"timestamp":1719390722}
   ]
   ```
   turns into
   ```
   jsonColumn: [
     
{"data":{"a":"1","b":"2"},"data.a":"1","timestamp":1719390721,"data.b":"2"},
     {"data":{"a":"2","b":"4"},"data.a":"2","timestamp":1719390722,"data.b":"4"}
   ]
   ```
   So as you can see, the inner maps of such a JSON object are being flattened 
making the result data different from the original.
   
   So I can see a couple of considerations here:
   1. complex type config always flattens nested maps regardless of "unnest 
fields" provided. Good or bad, but the user has no control over it, and for a 
JSON column without arrays it could result in the same issue? 
   **Possible solution**: introduce a new option similar to "unnest fields" but 
for flattening?
   2. different semantics of complex vs simple items in collections: I can see 
why one would want to preserve the original value of the collection after 
unnesting, but there is that inconsistency I mentioned before - each map item 
will be accessible by the dot separated names after unnesting and yet, not the 
other non-Map items.
   Moreover, technically, the "root" map is also lost after unnesting (which 
one might need for whatever reason).
   **Possible solutions**: 
   - Store that root item under a different name?
   - Have an option to control whether to preserve the original value under a 
different or the same name?
   
   ### Maybe instead we could refactor this transformer into a somewhat cleaner 
more controllable state?
   1. control what fields to "flatten" - only flattens maps (within collections 
or not) and only for provided fields
   2. control what fields to "unnest" - only unnests collections (without 
flattening maps) and only for provided fields
   3. control what fields to keep the original collection value after unnesting 
for (need to keep the original by default for backward compatibility, but the 
opposite of that would be the most logical imo).
   4. combining these 3 lets you gracefully decide what end result you want to 
achieve
   5. allow chaining complex type configs just like transformation configs: 
e.g. one config flattens, the next one unnests based on the flattened fields 
and so on
   
   Notes: the original can be preserved temporarily under the new name 
"<original_name>.$ORIGINAL$" and replaced back to its original name ONLY on the 
final step of `TransformPipeline`, this way, the unnested object could still be 
available for further transformations by the user if needed.
   
   ### Examples (config names are subject for discussion):
   Input:
   ```
   [
     {
       jsonField: [
         {"data":{"a":"1","b":"2"},"timestamp":1719390721},
         {"data":{"a":"2","b":"4"},"timestamp":1719390722}
       ]
     }
   ]
   ```
   Config:
   ```
     "fieldsToUnnest": "jsonField"
   ```
   Output:
   ```
   [
     {
       jsonField: {"data":{"a":"1","b":"2"},"timestamp":1719390721},
     },
     {
       jsonField: {"data":{"a":"2","b":"4"},"timestamp":1719390722}
     }
   ]
   ```
   -------------------------
   Config:
   ```
     "fieldsToUnnestAndKeepOriginals": "jsonField",
   ```
   
   Output:
   ```
   [
     {
       jsonField.$ORIGINAL$: [  // becomes `jsonField` in the end
         {"data":{"a":"1","b":"2"},"timestamp":1719390721},
         {"data":{"a":"2","b":"4"},"timestamp":1719390722}
       ],
       jsonField: {"data":{"a":"1","b":"2"},"timestamp":1719390721} // is lost 
in the end
     },
     {
       jsonField.$ORIGINAL$: [  // becomes `jsonField` in the end
         {"data":{"a":"1","b":"2"},"timestamp":1719390721},
         {"data":{"a":"2","b":"4"},"timestamp":1719390722}
       ],
       jsonField: {"data":{"a":"2","b":"4"},"timestamp":1719390722} // is lost 
in the end
     }
   ]
   ```
   ----------------------------
   Config 1:
   ```
     "fieldsToFlatten": "jsonField",
     "fieldsToUnnestAndKeepOriginals": "jsonField",
   ```
   Output 1:
   ```
   [
     {
       jsonField.$ORIGINAL$: [  // becomes `jsonField` in the end
         {"data":{"a":"1","b":"2"},"timestamp":1719390721},
         {"data":{"a":"2","b":"4"},"timestamp":1719390722}
       ],
       jsonField: {"data":{"a":"1","b":"2"},"timestamp":1719390721}, // is lost 
in the end
       jsonField.data: {"a":"1","b":"2"},
       jsonField.timestamp": 1719390721
     },
     {
       jsonField.$ORIGINAL$: [  // becomes `jsonField` in the end
         {"data":{"a":"1","b":"2"},"timestamp":1719390721},
         {"data":{"a":"2","b":"4"},"timestamp":1719390722}
       ],
       jsonField: {"data":{"a":"2","b":"4"},"timestamp":1719390722}, // is lost 
in the end
       jsonField.data: {"a":"2","b":"4"},
       jsonField.timestamp": 1719390722
     }
   ]
   ```
   Config 2:
   ```
     "fieldsToFlatten": "jsonField.data"
   ```
   Output 2:
   ```
   [
     {
       jsonField.$ORIGINAL$: [  // becomes `jsonField` in the end
         {"data":{"a":"1","b":"2"},"timestamp":1719390721},
         {"data":{"a":"2","b":"4"},"timestamp":1719390722}
       ],
       jsonField: {"data":{"a":"1","b":"2"},"timestamp":1719390721}, // is lost 
in the end
       jsonField.data: {"a":"1","b":"2"},
       jsonField.data.a: "1",
       jsonField.data.b: "2",
       jsonField.timestamp": 1719390721
     },
     {
       jsonField.$ORIGINAL$: [  // becomes `jsonField` in the end
         {"data":{"a":"1","b":"2"},"timestamp":1719390721},
         {"data":{"a":"2","b":"4"},"timestamp":1719390722}
       ],
       jsonField: {"data":{"a":"2","b":"4"},"timestamp":1719390722}, // is lost 
in the end
       jsonField.data: {"a":"2","b":"4"},
       jsonField.data.a: "2",
       jsonField.data.b: "4",
       jsonField.timestamp": 1719390722
     }
   ]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to