xitep commented on issue #2036:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/issues/2036#issuecomment-3626483130

   hello @eyalsatori,
   
   your comment made me compare a older, naive solution i developed earlier as 
part of some educational effort against  the hashmap/unicase setup you 
described. i, too, have been looking how to avoid `str.to_uppercase()` to avoid 
allocations.
   
   determining if a given string is a keyword is a very narrowed task due to 
the list of keywords being known at compile time, all of them being pretty 
short and ascii only. So, I thought of converting the string uppercase but keep 
it on the stack. Unfortunately, it required "unsafe":
   
   ```rust
   // KEYWORDS_MAX_LEN ... max length (in bytes) of the longest keyword
   struct MaybeKeyword([u8; KEYWORDS_MAX_LEN], usize);
   
   impl Deref for MaybeKeyword {
       type Target = [u8];
   
       fn deref(&self) -> &Self::Target {
           &self.0[..self.1]
       }
   }
   
   impl FromStr for MaybeKeyword {
       type Err = ();
   
       #[inline]
       fn from_str(s: &str) -> Result<Self, Self::Err> {
           let len = s.len();
           if len > KEYWORDS_MAX_LEN {
               Err(())
           } else {
               // ~ ensure `s` is uppercase (with minimal overhead)
               let mut upper = const { [MaybeUninit::<u8>::uninit(); 
KEYWORDS_MAX_LEN] };
               for (t, s) in upper.iter_mut().zip(s.as_bytes().iter().copied()) 
{
                   t.write(s.to_ascii_uppercase());
               }
               #[rustfmt::skip]
               let upper = unsafe {
                   std::mem::transmute::<[MaybeUninit<u8>; KEYWORDS_MAX_LEN], 
[u8; KEYWORDS_MAX_LEN]>(upper)
               };
               Ok(MaybeKeyword(upper, len))
           }
       }
   }
   ```
   
   with this i compared the following five "is keyword" implementations:
   ```
   // lookup via `unicase::Ascii` in
   1.    dict_unicase_ascii: &'static HashMap<unicase::Ascii<&'static str>, 
Keyword>,
   
   // lookup via `unicase::UniCase` in
   2.    dict_unicase_utf8: &'static HashMap<unicase::UniCase<&'static str>, 
Keyword>,
   
   // lookup `&*MaybeKeyword::from_str(s)?` in
   3.    dict_bytes_hashmap: &'static HashMap<&'static [u8], Keyword>,
   4.    dict_bytes_ahashmap: &'static ahash::AHashMap<&'static [u8], Keyword>,
   
   // `match` based impl via `let x = &*MaybeKeyword::from_str(s)?;`:
   5.    match x { b"ABORT" => Keyword::ABORT, b"ABS" => ... }````
   ```
   
   surprisingly, i got the following results on my local machine (using the 
keyword set from sqlparser-rs.) "a" is a benchmark over 100% hits (ie. all 
tested strings are keywords.) "b" is a benchmark over words where only approx 
1/5th are keywords:
   
   ```
   1a. from_str_unicase_ascii (all keywords)
                           time:   [35.219 µs 35.275 µs 35.330 µs]
   2a. from_str_unicase_utf8 (all keywords)
                           time:   [40.839 µs 40.924 µs 41.030 µs]
   3a. from_str_bytes_hashmap (all keywords)
                           time:   [25.688 µs 25.813 µs 25.943 µs]
   4a. from_str_bytes_ahashmap (all keywords)
                           time:   [24.229 µs 24.353 µs 24.470 µs]
   5a. from_str_match_based (all keywords)
                           time:   [18.622 µs 18.719 µs 18.815 µs]
   
   
   1b. from_str_unicase_ascii (many not-keywords)
                           time:   [90.389 µs 90.779 µs 91.263 µs]
   2b. from_str_unicase_utf8 (many not-keywords)
                           time:   [111.81 µs 112.14 µs 112.46 µs]
   3b. from_str_bytes_hashmap (many not-keywords)
                           time:   [50.167 µs 50.423 µs 50.668 µs]
   4b. from_str_bytes_ahashmap (many not-keywords)
                           time:   [43.907 µs 44.231 µs 44.553 µs]
   5.b from_str_match_based (many not-keywords)
                           time:   [52.387 µs 52.637 µs 52.902 µs]
   ```
   
   the plain matched based impl is quite competitive. its biggest advantage is 
that it doesn't require any lazy initialisation and, hence, can be used without 
any locks. (in the benchmarks i explicitly avoided acquiring the locks for the 
maps.) i thought about sharing this as a kind of inspiration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to