Tushar7012 commented on issue #20052:
URL: https://github.com/apache/datafusion/issues/20052#issuecomment-3813920291
Hi @alamb , I'd like to work on this issue!
## Root Cause Analysis
The regression starts after PR #19674, which introduced name-based struct
field matching. The key overhead comes from:
1. **`cast_struct_column()` in `nested_struct.rs`** - performs field-by-name
matching with recursive struct handling
2. **`validate_struct_compatibility()`** - comprehensive compatibility
checks on every struct cast
3. **Additional validation in `ColumnarValue::cast_to`** - routes all struct
casts through new logic
## Proposed Approach
1. **Profile extended tests** to identify which tests are most affected
2. **Optimize hot paths**:
- Add fast-path when source/target schemas are identical
- Skip redundant re-validation when already verified at planning time
- Consider `#[inline]` hints for frequently-called casting functions
3. **Reduce overhead**:
- Early bailout in `validate_struct_compatibility()` for identical types
- Lazy evaluation for expensive field matching operations
## Next Steps
1. Set up local profiling to identify exact bottlenecks
2. Compare test durations before/after PR #19674
3. Submit targeted optimization PR
Could I please be assigned to this issue?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]