zhuqi-lucas commented on code in PR #16630:
URL: https://github.com/apache/datafusion/pull/16630#discussion_r2179270018
##########
datafusion/physical-plan/src/sorts/cursor.rs:
##########
@@ -288,6 +288,64 @@ impl CursorArray for StringViewArray {
}
}
+/// Todo use arrow-rs side api after:
<https://github.com/apache/arrow-rs/pull/7748> released
+/// Builds a 128-bit composite key for an inline value:
+///
+/// - High 96 bits: the inline data in big-endian byte order (for correct
lexicographical sorting).
+/// - Low 32 bits: the length in big-endian byte order, acting as a
tiebreaker so shorter strings
+/// (or those with fewer meaningful bytes) always numerically sort before
longer ones.
+///
+/// This function extracts the length and the 12-byte inline string data from
the raw
+/// little-endian `u128` representation, converts them to big-endian ordering,
and packs them
+/// into a single `u128` value suitable for fast, branchless comparisons.
+///
+/// ### Why include length?
+///
+/// A pure 96-bit content comparison can’t distinguish between two values
whose inline bytes
+/// compare equal—either because one is a true prefix of the other or because
zero-padding
+/// hides extra bytes. By tucking the 32-bit length into the lower bits, a
single `u128` compare
+/// handles both content and length in one go.
+///
+/// Example: comparing "bar" (3 bytes) vs "bar\0" (4 bytes)
+///
+/// | String | Bytes 0–4 (length LE) | Bytes 4–16 (data + padding) |
+/// |------------|-----------------------|---------------------------------|
+/// | `"bar"` | `03 00 00 00` | `62 61 72` + 9 × `00` |
+/// | `"bar\0"`| `04 00 00 00` | `62 61 72 00` + 8 × `00` |
+///
+/// Both inline parts become `62 61 72 00…00`, so they tie on content. The
length field
+/// then differentiates:
+///
+/// ```text
+/// key("bar") = 0x0000000000000000000062617200000003
+/// key("bar\0") = 0x0000000000000000000062617200000004
+/// ⇒ key("bar") < key("bar\0")
+/// ```
+#[inline(always)]
+pub fn inline_key_fast(raw: u128) -> u128 {
Review Comment:
Thank you @alamb for good point, addressed in latest PR!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]