Re: [PR] [fix](function) Improve numerical robustness of cosine_distance / cosine_similarity [doris]

via GitHub Wed, 27 May 2026 01:31:03 -0700


github-actions[bot] commented on code in PR #62840:
URL: https://github.com/apache/doris/pull/62840#discussion_r3309532958



##########
be/src/exprs/function/array/function_array_distance.cpp:
##########
@@ -31,15 +39,32 @@ float CosineDistance::distance(const float* x, const float* 
y, size_t d) {
         squared_x += x[i] * x[i];
         squared_y += y[i] * y[i];
     }
-    if (squared_x == 0 or squared_y == 0) {
+
+    if (squared_x == 0 || squared_y == 0) {
         return 2.0f;
     }
-    return 1 - dot_prod / sqrt(squared_x * squared_y);
+
+    // Accumulate the norm in double and take a single square root. Computing
+    // (double)squared_x * (double)squared_y cannot overflow for finite float 
inputs,
+    // whereas the float expression sqrt(squared_x * squared_y) overflows to 
+inf for
+    // large-magnitude vectors and would silently yield a distance of 1.0.
+    const double norm = std::sqrt(static_cast<double>(squared_x) * 
static_cast<double>(squared_y));
+    // Clamp the cosine to [-1, 1] before mapping to a distance. 
Floating-point rounding
+    // can push the ratio slightly outside [-1, 1] (e.g. 1.0000001 for 
identical vectors),
+    // which would otherwise produce a tiny negative distance.
+    const float cosine = std::clamp(static_cast<float>(dot_prod / norm), 
-1.0f, 1.0f);
+    return 1.0f - cosine;
 }
 FAISS_PRAGMA_IMPRECISE_FUNCTION_END
 
 FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
 float CosineSimilarity::distance(const float* x, const float* y, size_t d) {
+    if (d == 0) {
+        return 0.0f;
+    }
+
+    DCHECK(x != nullptr && y != nullptr);
+
     float dot_prod = 0;

Review Comment:
   This still overflows before the new `double` norm is computed because 
`dot_prod`, `squared_x`, and `squared_y` are accumulated as `float` and the 
products are evaluated as `float`. For a valid finite FLOAT input such as 
`cosine_similarity([1e20], [1e20])`, `x[i] * x[i]` and `x[i] * y[i]` become 
`+inf`; then `dot_prod / norm` is `inf / inf`, producing `NaN` (`std::clamp` 
does not sanitize NaN). The expected result for parallel vectors is still `1.0` 
similarity / `0.0` distance. Please widen the accumulators and the per-element 
products to `double` before accumulation in both cosine functions, and add a 
test that covers this overflow-at-accumulation case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](function) Improve numerical robustness of cosine_distance / cosine_similarity [doris]

Reply via email to