eyalsatori commented on issue #2036: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/2036#issuecomment-3622531203
**Progress Update - Found the Main Cause of Performance Regression** **TL;DR** - The issue was in how I handled the [`make_word`](https://github.com/apache/datafusion-sqlparser-rs/blob/main/src/tokenizer.rs#L397) function. String borrowing improved performance from 2,915 µs to 2,801 µs (~4% improvement). **Root Cause** To avoid the `word.to_uppercase()` call in [`make_word`](https://github.com/apache/datafusion-sqlparser-rs/blob/main/src/tokenizer.rs#L397), I implemented a custom case-insensitive string comparison function. Profiling revealed this function was expensive and caused the performance regression. **Solution** Instead of using the `ALL_KEYWORDS` array, I created a `HashMap` with keywords stored as [`Unicase`](https://docs.rs/unicase/latest/unicase/) strings. The hash map is initialized once at runtime using [`OnceLock`](https://doc.rust-lang.org/std/sync/struct.OnceLock.html), giving us O(1) lookup performance. **Next Steps** While this shows only a ~4% improvement in the benchmark, I believe the real-world impact will be more significant. By dramatically reducing allocations, programs with more fragmented heaps should see better performance. I feel confident continuing in this direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
