goldmedal commented on issue #11268:
URL: https://github.com/apache/datafusion/issues/11268#issuecomment-2211734625

   > btw, there is probably a downside for using MapBuilder, since we need to 
create different builder for different types (therefore many macros) which 
easily causes code bloat. We have similar issue while building array function 
before #7988. I'm not sure what is the best approach yet (probably current 
draft function is the best), worth to explore it.
   
   Besides the code bloat, I'm also concerned about the performance. Because 
`MapBuilder` doesn't provide batch append (I think it's not its main purpose), 
we can only append data entry by entry. I might expect it to have O(n) 
complexity.
   
   I did some benchmarks to compare the `MapBuilder` version and the 
`MapArray::from()` version.
   
https://github.com/goldmedal/datafusion/blob/95198c17eddd0a708e61e66076a383d5671646b7/datafusion/functions/src/core/map.rs#L132
   
   I haven't tried any optimizations for them. The results are as follows:
   
   ```rust
   // The first run is for warm-up
   Generating 1 random key-value pair
   Time elapsed in make_map() is: 449.737µs
   Time elapsed in make_map_batch() is: 129.096µs
   
   Generating 10 random key-value pairs
   Time elapsed in make_map() is: 24.932µs
   Time elapsed in make_map_batch() is: 18.007µs
   
   Generating 50 random key-value pairs
   Time elapsed in make_map() is: 34.587µs
   Time elapsed in make_map_batch() is: 17.037µs
   
   Generating 100 random key-value pairs
   Time elapsed in make_map() is: 66.02µs
   Time elapsed in make_map_batch() is: 19.027µs
   
   Generating 500 random key-value pairs
   Time elapsed in make_map() is: 586.476µs
   Time elapsed in make_map_batch() is: 63.832µs
   
   Generating 1000 random key-value pairs
   Time elapsed in make_map() is: 722.771µs
   Time elapsed in make_map_batch() is: 47.455µs
   ```
   
   The scenario for this function may not be for large maps, but I think the 
`MapArray::from()` version always has better performance and won't cause code 
bloat. I prefer this version.
   
   What do you think? @alamb @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to