eyalleshem commented on issue #2036:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/issues/2036#issuecomment-3613724666

   ## Progress Update: Borrowed Tokenizer Implementation & Performance Analysis
   
   I've reached a point where I have working code that borrows the following 
token types:
   * `Word` (identifiers and keywords)
   * `SingleQuotedString`
   * `Comment` (single-line and multiline)
   
   The implementation is available in [this 
commit](https://github.com/apache/datafusion-sqlparser-rs/commit/f8c666c4f9af412e4737800aa17db61b78ad02aa)
 (not yet opened as a PR—needs some rebase work first).
   
   ### Benchmarking Methodology
   
   To measure the impact, I created a large SQL query (~30K strings parsed into 
~12K tokens) that heavily uses the borrowed token types. The test SQL is 
available 
[here](https://gist.github.com/eyalleshem/4314670906cd8feb5f8f2d523b1f9399). 
   
   I built a benchmark that tokenizes the query 1000 times, using:
   - [dhat](https://github.com/nnethercote/dhat-rs) for memory allocation 
profiling
   - Simple timers for runtime measurements (with dhat disabled to avoid 
profiling overhead)
   
   Also created a benchmark example (`examples/benchmark_test.rs`) that uses 
dhat to measure allocation impact and help quantify the performance 
improvements from this commit.
   
   ### Results
   
   The numbers are interesting but somewhat disappointing:
   
   #### Memory (analyzed by dhat)
   
   **Before changes:**
   ```
   dhat: Total:     4,440,724,375 bytes in 11,641,632 blocks
   dhat: At t-gmax: 2,929,791 bytes in 5,111 blocks
   ```
   
   **After changes:**
   ```
   dhat: Total:     4,333,007,766 bytes in 511,513 blocks
   dhat: At t-gmax: 2,884,856 bytes in 102 blocks
   ```
   
   We reduced allocation operations dramatically:
   - **Total allocations**: 11,641,632 → 511,513 blocks (95.6% reduction)
   - **Peak allocations**: 5,111 → 102 blocks (98% reduction)
   - **Total memory allocated**: Only a small reduction (~2.4%)
   
   #### Runtime
   
   Performance decreased from **2,915 µs per query** to **3,743 µs per query** 
(~28% slower).
   
   ### Analysis
   
   #### Why is total memory improvement negligible?
   
   Looking at the dhat analysis, the dominant memory usage comes from 
`buf.push` in `tokenize_with_location_into_buf` 
[link](https://github.com/apache/datafusion-sqlparser-rs/blob/main/src/tokenizer.rs#L904).
   
   From dhat output (2 iteration example):
   ```
   Total: 8,650,016 bytes (99.9%, 278,144,506.25/s) in 28 blocks (2.73%, 
900.35/s)
   Avg size: 308,929.14 bytes, avg lifetime: 474.39 µs (1.53% of program 
duration)
   Max: 2,883,568 bytes in 2 blocks, avg size: 1,441,784 bytes
   ```
   
   The math checks out: for 12,300 tokens with amortized vector growth to 
16,384 entries at 88 bytes each = 1,441,792 bytes × 2 token arrays = 2,883,584 
bytes (matching dhat's report).
   
   **The borrowed tokens optimization only affects the strings _within_ each 
token struct, not the vector allocation itself, which dominates memory usage.**
   
   #### About the performance regression
   
   I haven't done runtime profiling yet to identify specific hotspots, but we 
added position tracking logic without removing the old peekable char logic. I 
expected that reducing allocations (95%+ reduction in allocation operations) 
would reduce system calls enough to offset this overhead, but that didn't 
materialize.
   
   One theory: the benchmark runs on a fresh heap that isn't fragmented yet, 
making allocations cheaper (less searching for free blocks, less OS 
interaction). The allocation reduction might show more benefit in a real-world 
scenario with a fragmented heap, but I'm not certain about this.
   
   ### Next Steps
   
   I plan to investigate the 28% performance regression to understand where the 
overhead is coming from before making a final decision on this approach.
   
   ### Questions for Maintainers
   
   Given the radical nature of these tokenization changes and the unclear 
performance benefits so far, do you think it's worth exploring additional use 
cases (many small queries, full parsing pipeline, multi-threaded workloads) to 
better understand where this optimization might help? Or do the changes seem 
too invasive for the gains we're seeing?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to