zhuqi-lucas commented on code in PR #16630: URL: https://github.com/apache/datafusion/pull/16630#discussion_r2179270018
########## datafusion/physical-plan/src/sorts/cursor.rs: ########## @@ -288,6 +288,64 @@ impl CursorArray for StringViewArray { } } +/// Todo use arrow-rs side api after: <https://github.com/apache/arrow-rs/pull/7748> released +/// Builds a 128-bit composite key for an inline value: +/// +/// - High 96 bits: the inline data in big-endian byte order (for correct lexicographical sorting). +/// - Low 32 bits: the length in big-endian byte order, acting as a tiebreaker so shorter strings +/// (or those with fewer meaningful bytes) always numerically sort before longer ones. +/// +/// This function extracts the length and the 12-byte inline string data from the raw +/// little-endian `u128` representation, converts them to big-endian ordering, and packs them +/// into a single `u128` value suitable for fast, branchless comparisons. +/// +/// ### Why include length? +/// +/// A pure 96-bit content comparison can’t distinguish between two values whose inline bytes +/// compare equal—either because one is a true prefix of the other or because zero-padding +/// hides extra bytes. By tucking the 32-bit length into the lower bits, a single `u128` compare +/// handles both content and length in one go. +/// +/// Example: comparing "bar" (3 bytes) vs "bar\0" (4 bytes) +/// +/// | String | Bytes 0–4 (length LE) | Bytes 4–16 (data + padding) | +/// |------------|-----------------------|---------------------------------| +/// | `"bar"` | `03 00 00 00` | `62 61 72` + 9 × `00` | +/// | `"bar\0"`| `04 00 00 00` | `62 61 72 00` + 8 × `00` | +/// +/// Both inline parts become `62 61 72 00…00`, so they tie on content. The length field +/// then differentiates: +/// +/// ```text +/// key("bar") = 0x0000000000000000000062617200000003 +/// key("bar\0") = 0x0000000000000000000062617200000004 +/// ⇒ key("bar") < key("bar\0") +/// ``` +#[inline(always)] +pub fn inline_key_fast(raw: u128) -> u128 { Review Comment: Thank you @alamb for good point, addressed in latest PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org