yx-keith commented on code in PR #62840:
URL: https://github.com/apache/doris/pull/62840#discussion_r3309635908


##########
be/src/exprs/function/array/function_array_distance.cpp:
##########
@@ -31,15 +39,32 @@ float CosineDistance::distance(const float* x, const float* 
y, size_t d) {
         squared_x += x[i] * x[i];
         squared_y += y[i] * y[i];
     }
-    if (squared_x == 0 or squared_y == 0) {
+
+    if (squared_x == 0 || squared_y == 0) {
         return 2.0f;
     }
-    return 1 - dot_prod / sqrt(squared_x * squared_y);
+
+    // Accumulate the norm in double and take a single square root. Computing
+    // (double)squared_x * (double)squared_y cannot overflow for finite float 
inputs,
+    // whereas the float expression sqrt(squared_x * squared_y) overflows to 
+inf for
+    // large-magnitude vectors and would silently yield a distance of 1.0.
+    const double norm = std::sqrt(static_cast<double>(squared_x) * 
static_cast<double>(squared_y));
+    // Clamp the cosine to [-1, 1] before mapping to a distance. 
Floating-point rounding
+    // can push the ratio slightly outside [-1, 1] (e.g. 1.0000001 for 
identical vectors),
+    // which would otherwise produce a tiny negative distance.
+    const float cosine = std::clamp(static_cast<float>(dot_prod / norm), 
-1.0f, 1.0f);
+    return 1.0f - cosine;
 }
 FAISS_PRAGMA_IMPRECISE_FUNCTION_END
 
 FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
 float CosineSimilarity::distance(const float* x, const float* y, size_t d) {
+    if (d == 0) {
+        return 0.0f;
+    }
+
+    DCHECK(x != nullptr && y != nullptr);
+
     float dot_prod = 0;

Review Comment:
   Thanks for the catch. The per-element overflow only triggers when |x[i]| > 
sqrt(FLT_MAX) ≈ 1.84e19, which is even more pathological than the outer-product 
overflow this PR addresses. Widening accumulators to double would halve SIMD 
throughput (~5-20% end-to-end regression in vector search workloads) while only 
covering a stricter subset of overflow scenarios that have never been observed 
in real embeddings. We prefer to keep the current scope; happy to file a 
follow-up if a real-world case is reported.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to