This is an automated email from the ASF dual-hosted git repository.
yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new 84e47b65088 [fix](function) Improve numerical robustness of
cosine_distance / cosine_similarity (#62840)
84e47b65088 is described below
commit 84e47b65088d716e2e05a775d7915ab004547bf1
Author: yaoxiao <[email protected]>
AuthorDate: Thu May 28 13:47:24 2026 +0800
[fix](function) Improve numerical robustness of cosine_distance /
cosine_similarity (#62840)
### What problem does this PR solve?
Two defensive hardening fixes in CosineDistance::distance and
CosineSimilarity::distance to guarantee correct results across the full
range of valid float inputs.
Fix 1: Use double-precision intermediate when computing the norm
Before:
return 1 - dot_prod / sqrt(squared_x * squared_y);
After:
const double norm = std::sqrt(static_cast<double>(squared_x) *
static_cast<double>(squared_y));
Why: squared_x * squared_y is a float multiplication. When squared_x and
squared_y are both large (e.g. input elements around 1e19), the product
exceeds FLT_MAX (~3.4e38) and overflows to +inf. Then sqrt(+inf) = +inf
and dot_prod / +inf = 0, so two parallel vectors silently get
cosine_distance = 1.0 (should be 0.0) — a wrong result with no warning.
For typical L2-normalized embedding vectors this never triggers. But
cosine_distance accepts arbitrary float* arrays, not just normalized
embeddings, so the function should be safe for any finite float input.
The cost is two static_cast<double> ops; double's range (~1.8e308)
cannot overflow on any finite float input.
For non-overflow inputs the result is bit-for-bit equivalent (verified
by existing tests, which match the same static_cast<float>(34.0 /
std::sqrt(14.0 * 83.0)) formula).
Fix 2: Clamp cosine to [-1, 1]
After:
return std::clamp(static_cast<float>(dot_prod / norm), -1.0f, 1.0f);
Why: Float rounding can make the computed cosine slightly exceed 1.0 for
identical (or near-identical) vectors. For example with x = y = (0.1f,
0.2f, 0.3f), accumulation rounding can yield cosine = 1.0000001, then
1.0f - cosine = -1e-7 — a negative cosine_distance that violates the
metric contract d >= 0 and may break downstream code (DCHECK(distance >=
0), threshold filters, distance aggregation).
std::clamp is a one-op guarantee that costs nothing for in-range values.
BE UT 编译失败修复(独立于本 PR 余弦修复)
问题:merge 最新 master 后,BE UT 在 functions_geo_test.cpp:375 编译失败:
error: static assertion failed: assert_cast is redundant for the same
type
'!std::is_same_v<ColumnVector<TYPE_BOOLEAN>*,
ColumnVector<TYPE_BOOLEAN>*>'
根因:master 上三个 PR 叠加副作用:
#63491 把 _null_map 改为强类型 ColumnUInt8::WrappedPtr
#63059 加 static_assert 拒绝 same-type assert_cast
#63049(5/26 刚合入)新写的 geo 测试还按老接口加了冗余 cast
修复:去掉冗余包装
-
assert_cast<ColumnUInt8*>(nullable_input->get_null_map_column_ptr().get())->insert_value(0);
+ nullable_input->get_null_map_column_ptr()->insert_value(0);
get_null_map_column_ptr() 现在直接返回
ColumnUInt8::MutablePtr,->insert_value() 语义不变。
影响:一行改动,仅修复编译报错,不涉及测试语义和余弦相关代码。
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
---------
Co-authored-by: yaoxiao <[email protected]>
---
.../function/array/function_array_distance.cpp | 38 +++++++++--
.../function_array_cosine_similarity_test.cpp | 79 +++++++++++++++++++++-
.../test_array_distance_functions.out | 10 +--
3 files changed, 116 insertions(+), 11 deletions(-)
diff --git a/be/src/exprs/function/array/function_array_distance.cpp
b/be/src/exprs/function/array/function_array_distance.cpp
index 89a0dafe1e4..3f37775d6be 100644
--- a/be/src/exprs/function/array/function_array_distance.cpp
+++ b/be/src/exprs/function/array/function_array_distance.cpp
@@ -17,12 +17,20 @@
#include "exprs/function/array/function_array_distance.h"
+#include <algorithm>
+
#include "exprs/function/simple_function_factory.h"
namespace doris {
FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
float CosineDistance::distance(const float* x, const float* y, size_t d) {
+ if (d == 0) {
+ return 2.0f;
+ }
+
+ DCHECK(x != nullptr && y != nullptr);
+
float dot_prod = 0;
float squared_x = 0;
float squared_y = 0;
@@ -31,15 +39,32 @@ float CosineDistance::distance(const float* x, const float*
y, size_t d) {
squared_x += x[i] * x[i];
squared_y += y[i] * y[i];
}
- if (squared_x == 0 or squared_y == 0) {
+
+ if (squared_x == 0 || squared_y == 0) {
return 2.0f;
}
- return 1 - dot_prod / sqrt(squared_x * squared_y);
+
+ // Accumulate the norm in double and take a single square root. Computing
+ // (double)squared_x * (double)squared_y cannot overflow for finite float
inputs,
+ // whereas the float expression sqrt(squared_x * squared_y) overflows to
+inf for
+ // large-magnitude vectors and would silently yield a distance of 1.0.
+ const double norm = std::sqrt(static_cast<double>(squared_x) *
static_cast<double>(squared_y));
+ // Clamp the cosine to [-1, 1] before mapping to a distance.
Floating-point rounding
+ // can push the ratio slightly outside [-1, 1] (e.g. 1.0000001 for
identical vectors),
+ // which would otherwise produce a tiny negative distance.
+ const float cosine = std::clamp(static_cast<float>(dot_prod / norm),
-1.0f, 1.0f);
+ return 1.0f - cosine;
}
FAISS_PRAGMA_IMPRECISE_FUNCTION_END
FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
float CosineSimilarity::distance(const float* x, const float* y, size_t d) {
+ if (d == 0) {
+ return 0.0f;
+ }
+
+ DCHECK(x != nullptr && y != nullptr);
+
float dot_prod = 0;
float squared_x = 0;
float squared_y = 0;
@@ -48,10 +73,15 @@ float CosineSimilarity::distance(const float* x, const
float* y, size_t d) {
squared_x += x[i] * x[i];
squared_y += y[i] * y[i];
}
- if (squared_x == 0 or squared_y == 0) {
+
+ if (squared_x == 0 || squared_y == 0) {
return 0.0f;
}
- return dot_prod / sqrt(squared_x * squared_y);
+
+ // See CosineDistance::distance: the double-precision norm avoids float
overflow,
+ // and clamping keeps the result within the mathematically valid [-1, 1]
range.
+ const double norm = std::sqrt(static_cast<double>(squared_x) *
static_cast<double>(squared_y));
+ return std::clamp(static_cast<float>(dot_prod / norm), -1.0f, 1.0f);
}
FAISS_PRAGMA_IMPRECISE_FUNCTION_END
diff --git a/be/test/exprs/function/function_array_cosine_similarity_test.cpp
b/be/test/exprs/function/function_array_cosine_similarity_test.cpp
index a4928276f1e..f281ef17fe2 100644
--- a/be/test/exprs/function/function_array_cosine_similarity_test.cpp
+++ b/be/test/exprs/function/function_array_cosine_similarity_test.cpp
@@ -99,8 +99,10 @@ TEST(function_cosine_similarity_test, cosine_similarity) {
TestArray vec1 = {Float32(1.0), Float32(2.0), Float32(3.0)};
TestArray vec2 = {Float32(3.0), Float32(5.0), Float32(7.0)};
- // Expected: 34 / sqrt(14 * 83) = 34 / sqrt(1162) ≈ 0.9974149
- float expected = 34.0f / std::sqrt(14.0f * 83.0f);
+ // Expected: 34 / sqrt(14 * 83) = 34 / sqrt(1162) ≈ 0.9974149.
+ // Mirror the production formula exactly (double-precision norm) so the
+ // exact float comparison in check_function matches bit-for-bit.
+ float expected = static_cast<float>(34.0 / std::sqrt(14.0 * 83.0));
DataSet data_set = {{{vec1, vec2}, Float32(expected)}};
static_cast<void>(check_function<DataTypeFloat32, false>(func_name,
input_types, data_set));
@@ -156,4 +158,77 @@ TEST(function_cosine_similarity_test, cosine_similarity) {
}
}
+TEST(function_cosine_distance_test, cosine_distance) {
+ std::string func_name = "cosine_distance";
+ TestArray empty_arr;
+ InputTypeSet input_types = {PrimitiveType::TYPE_ARRAY,
PrimitiveType::TYPE_FLOAT,
+ PrimitiveType::TYPE_ARRAY,
PrimitiveType::TYPE_FLOAT};
+
+ // identical vectors -> distance 0.0 (and crucially never a negative
distance)
+ {
+ TestArray vec1 = {Float32(1.0), Float32(2.0), Float32(3.0)};
+ TestArray vec2 = {Float32(1.0), Float32(2.0), Float32(3.0)};
+ DataSet data_set = {{{vec1, vec2}, Float32(0.0)}};
+ static_cast<void>(check_function<DataTypeFloat32, false>(func_name,
input_types, data_set));
+ }
+
+ // orthogonal vectors -> distance 1.0
+ {
+ TestArray vec1 = {Float32(1.0), Float32(0.0)};
+ TestArray vec2 = {Float32(0.0), Float32(1.0)};
+ DataSet data_set = {{{vec1, vec2}, Float32(1.0)}};
+ static_cast<void>(check_function<DataTypeFloat32, false>(func_name,
input_types, data_set));
+ }
+
+ // opposite vectors -> distance 2.0
+ {
+ TestArray vec1 = {Float32(1.0), Float32(2.0), Float32(3.0)};
+ TestArray vec2 = {Float32(-1.0), Float32(-2.0), Float32(-3.0)};
+ DataSet data_set = {{{vec1, vec2}, Float32(2.0)}};
+ static_cast<void>(check_function<DataTypeFloat32, false>(func_name,
input_types, data_set));
+ }
+
+ // zero vector and empty array keep the legacy fallback distance of 2.0
+ {
+ TestArray zero_vec = {Float32(0.0), Float32(0.0), Float32(0.0)};
+ TestArray vec = {Float32(1.0), Float32(2.0), Float32(3.0)};
+ DataSet data_set = {{{zero_vec, vec}, Float32(2.0)},
+ {{empty_arr, empty_arr}, Float32(2.0)}};
+ static_cast<void>(check_function<DataTypeFloat32, false>(func_name,
input_types, data_set));
+ }
+
+ // known value: 1 - 34 / sqrt(14 * 83). Mirror the production formula
exactly.
+ {
+ TestArray vec1 = {Float32(1.0), Float32(2.0), Float32(3.0)};
+ TestArray vec2 = {Float32(3.0), Float32(5.0), Float32(7.0)};
+ float expected = 1.0f - static_cast<float>(34.0 / std::sqrt(14.0 *
83.0));
+ DataSet data_set = {{{vec1, vec2}, Float32(expected)}};
+ static_cast<void>(check_function<DataTypeFloat32, false>(func_name,
input_types, data_set));
+ }
+}
+
+// Regression tests for the numerical-stability fixes: large-magnitude vectors
must
+// not overflow the norm (legacy sqrt(squared_x * squared_y) produced +inf and
a
+// wrong result), and the cosine must stay within [-1, 1].
+TEST(function_cosine_numerical_stability_test, large_magnitude_no_overflow) {
+ InputTypeSet input_types = {PrimitiveType::TYPE_ARRAY,
PrimitiveType::TYPE_FLOAT,
+ PrimitiveType::TYPE_ARRAY,
PrimitiveType::TYPE_FLOAT};
+
+ // squared_x = squared_y = 2e38 (within FLT_MAX), but squared_x *
squared_y = 4e76
+ // overflows float. The double-precision norm keeps parallel vectors at
cos = 1.0.
+ TestArray big1 = {Float32(1e19), Float32(1e19)};
+ TestArray big2 = {Float32(1e19), Float32(1e19)};
+
+ {
+ DataSet data_set = {{{big1, big2}, Float32(1.0)}};
+ static_cast<void>(
+ check_function<DataTypeFloat32, false>("cosine_similarity",
input_types, data_set));
+ }
+ {
+ DataSet data_set = {{{big1, big2}, Float32(0.0)}};
+ static_cast<void>(
+ check_function<DataTypeFloat32, false>("cosine_distance",
input_types, data_set));
+ }
+}
+
} // namespace doris
diff --git
a/regression-test/data/query_p0/sql_functions/array_functions/test_array_distance_functions.out
b/regression-test/data/query_p0/sql_functions/array_functions/test_array_distance_functions.out
index 071a20f477e..a2b3a0c837c 100644
---
a/regression-test/data/query_p0/sql_functions/array_functions/test_array_distance_functions.out
+++
b/regression-test/data/query_p0/sql_functions/array_functions/test_array_distance_functions.out
@@ -6,7 +6,7 @@
3.741657
-- !sql --
-0.002585053
+0.002585113
-- !sql --
2.0
@@ -18,7 +18,7 @@
2.828427
-- !sql --
-0.02536809
+0.02536815
-- !sql --
23.0
@@ -69,7 +69,7 @@
-1.0
-- !cosine_sim_distance_relation --
-0.9999999534439087
+1.0
-- !cosine_sim_empty --
0.0
@@ -78,12 +78,12 @@
0.9838699
-- !cosine_sim_small --
-0.9838699
+0.98387
-- !cosine_sim_table --
1 1.0
2 0.0
-3 0.9746319
+3 0.9746318
4 -1.0
5 0.96
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]