yx-keith commented on code in PR #62840:
URL: https://github.com/apache/doris/pull/62840#discussion_r3309635908
##########
be/src/exprs/function/array/function_array_distance.cpp:
##########
@@ -31,15 +39,32 @@ float CosineDistance::distance(const float* x, const float*
y, size_t d) {
squared_x += x[i] * x[i];
squared_y += y[i] * y[i];
}
- if (squared_x == 0 or squared_y == 0) {
+
+ if (squared_x == 0 || squared_y == 0) {
return 2.0f;
}
- return 1 - dot_prod / sqrt(squared_x * squared_y);
+
+ // Accumulate the norm in double and take a single square root. Computing
+ // (double)squared_x * (double)squared_y cannot overflow for finite float
inputs,
+ // whereas the float expression sqrt(squared_x * squared_y) overflows to
+inf for
+ // large-magnitude vectors and would silently yield a distance of 1.0.
+ const double norm = std::sqrt(static_cast<double>(squared_x) *
static_cast<double>(squared_y));
+ // Clamp the cosine to [-1, 1] before mapping to a distance.
Floating-point rounding
+ // can push the ratio slightly outside [-1, 1] (e.g. 1.0000001 for
identical vectors),
+ // which would otherwise produce a tiny negative distance.
+ const float cosine = std::clamp(static_cast<float>(dot_prod / norm),
-1.0f, 1.0f);
+ return 1.0f - cosine;
}
FAISS_PRAGMA_IMPRECISE_FUNCTION_END
FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
float CosineSimilarity::distance(const float* x, const float* y, size_t d) {
+ if (d == 0) {
+ return 0.0f;
+ }
+
+ DCHECK(x != nullptr && y != nullptr);
+
float dot_prod = 0;
Review Comment:
Thanks for the catch. The per-element overflow only triggers when |x[i]| >
sqrt(FLT_MAX) ≈ 1.84e19, which is even more pathological than the outer-product
overflow this PR addresses. Widening accumulators to double would halve SIMD
throughput (~5-20% end-to-end regression in vector search workloads) while only
covering a stricter subset of overflow scenarios that have never been observed
in real embeddings. We prefer to keep the current scope; happy to file a
follow-up if a real-world case is reported.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]