eyalleshem commented on issue #2036: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/2036#issuecomment-3613724666
## Progress Update: Borrowed Tokenizer Implementation & Performance Analysis I've reached a point where I have working code that borrows the following token types: * `Word` (identifiers and keywords) * `SingleQuotedString` * `Comment` (single-line and multiline) The implementation is available in [this commit](https://github.com/apache/datafusion-sqlparser-rs/commit/f8c666c4f9af412e4737800aa17db61b78ad02aa) (not yet opened as a PR—needs some rebase work first). ### Benchmarking Methodology To measure the impact, I created a large SQL query (~30K strings parsed into ~12K tokens) that heavily uses the borrowed token types. The test SQL is available [here](https://gist.github.com/eyalleshem/4314670906cd8feb5f8f2d523b1f9399). I built a benchmark that tokenizes the query 1000 times, using: - [dhat](https://github.com/nnethercote/dhat-rs) for memory allocation profiling - Simple timers for runtime measurements (with dhat disabled to avoid profiling overhead) Also created a benchmark example (`examples/benchmark_test.rs`) that uses dhat to measure allocation impact and help quantify the performance improvements from this commit. ### Results The numbers are interesting but somewhat disappointing: #### Memory (analyzed by dhat) **Before changes:** ``` dhat: Total: 4,440,724,375 bytes in 11,641,632 blocks dhat: At t-gmax: 2,929,791 bytes in 5,111 blocks ``` **After changes:** ``` dhat: Total: 4,333,007,766 bytes in 511,513 blocks dhat: At t-gmax: 2,884,856 bytes in 102 blocks ``` We reduced allocation operations dramatically: - **Total allocations**: 11,641,632 → 511,513 blocks (95.6% reduction) - **Peak allocations**: 5,111 → 102 blocks (98% reduction) - **Total memory allocated**: Only a small reduction (~2.4%) #### Runtime Performance decreased from **2,915 µs per query** to **3,743 µs per query** (~28% slower). ### Analysis #### Why is total memory improvement negligible? Looking at the dhat analysis, the dominant memory usage comes from `buf.push` in `tokenize_with_location_into_buf` [link](https://github.com/apache/datafusion-sqlparser-rs/blob/main/src/tokenizer.rs#L904). From dhat output (2 iteration example): ``` Total: 8,650,016 bytes (99.9%, 278,144,506.25/s) in 28 blocks (2.73%, 900.35/s) Avg size: 308,929.14 bytes, avg lifetime: 474.39 µs (1.53% of program duration) Max: 2,883,568 bytes in 2 blocks, avg size: 1,441,784 bytes ``` The math checks out: for 12,300 tokens with amortized vector growth to 16,384 entries at 88 bytes each = 1,441,792 bytes × 2 token arrays = 2,883,584 bytes (matching dhat's report). **The borrowed tokens optimization only affects the strings _within_ each token struct, not the vector allocation itself, which dominates memory usage.** #### About the performance regression I haven't done runtime profiling yet to identify specific hotspots, but we added position tracking logic without removing the old peekable char logic. I expected that reducing allocations (95%+ reduction in allocation operations) would reduce system calls enough to offset this overhead, but that didn't materialize. One theory: the benchmark runs on a fresh heap that isn't fragmented yet, making allocations cheaper (less searching for free blocks, less OS interaction). The allocation reduction might show more benefit in a real-world scenario with a fragmented heap, but I'm not certain about this. ### Next Steps I plan to investigate the 28% performance regression to understand where the overhead is coming from before making a final decision on this approach. ### Questions for Maintainers Given the radical nature of these tokenization changes and the unclear performance benefits so far, do you think it's worth exploring additional use cases (many small queries, full parsing pipeline, multi-threaded workloads) to better understand where this optimization might help? Or do the changes seem too invasive for the gains we're seeing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
