(doris) branch master updated: [fix](function) Improve numerical robustness of cosine_distance / cosine_similarity (#62840)

yiguolei Wed, 27 May 2026 22:49:40 -0700

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/master by this push:
     new 84e47b65088 [fix](function) Improve numerical robustness of 
cosine_distance / cosine_similarity (#62840)
84e47b65088 is described below

commit 84e47b65088d716e2e05a775d7915ab004547bf1
Author: yaoxiao <[email protected]>
AuthorDate: Thu May 28 13:47:24 2026 +0800

    [fix](function) Improve numerical robustness of cosine_distance / 
cosine_similarity (#62840)
    
    ### What problem does this PR solve?
    
    Two defensive hardening fixes in CosineDistance::distance and
    CosineSimilarity::distance to guarantee correct results across the full
    range of valid float inputs.
    
    Fix 1: Use double-precision intermediate when computing the norm
    Before:
    
    return 1 - dot_prod / sqrt(squared_x * squared_y);
    After:
    
    const double norm = std::sqrt(static_cast<double>(squared_x) *
    static_cast<double>(squared_y));
    Why: squared_x * squared_y is a float multiplication. When squared_x and
    squared_y are both large (e.g. input elements around 1e19), the product
    exceeds FLT_MAX (~3.4e38) and overflows to +inf. Then sqrt(+inf) = +inf
    and dot_prod / +inf = 0, so two parallel vectors silently get
    cosine_distance = 1.0 (should be 0.0) — a wrong result with no warning.
    
    For typical L2-normalized embedding vectors this never triggers. But
    cosine_distance accepts arbitrary float* arrays, not just normalized
    embeddings, so the function should be safe for any finite float input.
    The cost is two static_cast<double> ops; double's range (~1.8e308)
    cannot overflow on any finite float input.
    
    For non-overflow inputs the result is bit-for-bit equivalent (verified
    by existing tests, which match the same static_cast<float>(34.0 /
    std::sqrt(14.0 * 83.0)) formula).
    
    Fix 2: Clamp cosine to [-1, 1]
    After:
    
    return std::clamp(static_cast<float>(dot_prod / norm), -1.0f, 1.0f);
    Why: Float rounding can make the computed cosine slightly exceed 1.0 for
    identical (or near-identical) vectors. For example with x = y = (0.1f,
    0.2f, 0.3f), accumulation rounding can yield cosine = 1.0000001, then
    1.0f - cosine = -1e-7 — a negative cosine_distance that violates the
    metric contract d >= 0 and may break downstream code (DCHECK(distance >=
    0), threshold filters, distance aggregation).
    
    std::clamp is a one-op guarantee that costs nothing for in-range values.
    
    
    BE UT 编译失败修复（独立于本 PR 余弦修复）
    问题：merge 最新 master 后，BE UT 在 functions_geo_test.cpp:375 编译失败：
    
    
    error: static assertion failed: assert_cast is redundant for the same
    type
    '!std::is_same_v<ColumnVector<TYPE_BOOLEAN>*,
    ColumnVector<TYPE_BOOLEAN>*>'
    根因：master 上三个 PR 叠加副作用：
    
    #63491 把 _null_map 改为强类型 ColumnUInt8::WrappedPtr
    #63059 加 static_assert 拒绝 same-type assert_cast
    #63049（5/26 刚合入）新写的 geo 测试还按老接口加了冗余 cast
    修复：去掉冗余包装
    
    
    -
    
assert_cast<ColumnUInt8*>(nullable_input->get_null_map_column_ptr().get())->insert_value(0);
    + nullable_input->get_null_map_column_ptr()->insert_value(0);
    get_null_map_column_ptr() 现在直接返回
    ColumnUInt8::MutablePtr，->insert_value() 语义不变。
    
    影响：一行改动，仅修复编译报错，不涉及测试语义和余弦相关代码。
    
    
    
    ### Release note
    
    None
    
    ### Check List (For Author)
    
    - Test <!-- At least one of them must be included. -->
        - [ ] Regression test
        - [ ] Unit Test
        - [ ] Manual test (add detailed scripts or steps below)
        - [ ] No need to test or manual test. Explain why:
    - [ ] This is a refactor/code format and no logic has been changed.
            - [ ] Previous test can cover this change.
            - [ ] No code files have been changed.
            - [ ] Other reason <!-- Add your reason?  -->
    
    - Behavior changed:
        - [ ] No.
        - [ ] Yes. <!-- Explain the behavior change -->
    
    - Does this need documentation?
        - [ ] No.
    - [ ] Yes. <!-- Add document PR link here. eg:
    https://github.com/apache/doris-website/pull/1214 -->
    
    ### Check List (For Reviewer who merge this PR)
    
    - [ ] Confirm the release note
    - [ ] Confirm test cases
    - [ ] Confirm document
    - [ ] Add branch pick label <!-- Add branch pick label that this PR
    should merge into -->
    
    ---------
    
    Co-authored-by: yaoxiao <[email protected]>
---
 .../function/array/function_array_distance.cpp     | 38 +++++++++--
 .../function_array_cosine_similarity_test.cpp      | 79 +++++++++++++++++++++-
 .../test_array_distance_functions.out              | 10 +--
 3 files changed, 116 insertions(+), 11 deletions(-)

diff --git a/be/src/exprs/function/array/function_array_distance.cpp 
b/be/src/exprs/function/array/function_array_distance.cpp
index 89a0dafe1e4..3f37775d6be 100644
--- a/be/src/exprs/function/array/function_array_distance.cpp
+++ b/be/src/exprs/function/array/function_array_distance.cpp
@@ -17,12 +17,20 @@
 
 #include "exprs/function/array/function_array_distance.h"
 
+#include <algorithm>
+
 #include "exprs/function/simple_function_factory.h"
 
 namespace doris {
 
 FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
 float CosineDistance::distance(const float* x, const float* y, size_t d) {
+    if (d == 0) {
+        return 2.0f;
+    }
+
+    DCHECK(x != nullptr && y != nullptr);
+
     float dot_prod = 0;
     float squared_x = 0;
     float squared_y = 0;
@@ -31,15 +39,32 @@ float CosineDistance::distance(const float* x, const float* 
y, size_t d) {
         squared_x += x[i] * x[i];
         squared_y += y[i] * y[i];
     }
-    if (squared_x == 0 or squared_y == 0) {
+
+    if (squared_x == 0 || squared_y == 0) {
         return 2.0f;
     }
-    return 1 - dot_prod / sqrt(squared_x * squared_y);
+
+    // Accumulate the norm in double and take a single square root. Computing
+    // (double)squared_x * (double)squared_y cannot overflow for finite float 
inputs,
+    // whereas the float expression sqrt(squared_x * squared_y) overflows to 
+inf for
+    // large-magnitude vectors and would silently yield a distance of 1.0.
+    const double norm = std::sqrt(static_cast<double>(squared_x) * 
static_cast<double>(squared_y));
+    // Clamp the cosine to [-1, 1] before mapping to a distance. 
Floating-point rounding
+    // can push the ratio slightly outside [-1, 1] (e.g. 1.0000001 for 
identical vectors),
+    // which would otherwise produce a tiny negative distance.
+    const float cosine = std::clamp(static_cast<float>(dot_prod / norm), 
-1.0f, 1.0f);
+    return 1.0f - cosine;
 }
 FAISS_PRAGMA_IMPRECISE_FUNCTION_END
 
 FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
 float CosineSimilarity::distance(const float* x, const float* y, size_t d) {
+    if (d == 0) {
+        return 0.0f;
+    }
+
+    DCHECK(x != nullptr && y != nullptr);
+
     float dot_prod = 0;
     float squared_x = 0;
     float squared_y = 0;
@@ -48,10 +73,15 @@ float CosineSimilarity::distance(const float* x, const 
float* y, size_t d) {
         squared_x += x[i] * x[i];
         squared_y += y[i] * y[i];
     }
-    if (squared_x == 0 or squared_y == 0) {
+
+    if (squared_x == 0 || squared_y == 0) {
         return 0.0f;
     }
-    return dot_prod / sqrt(squared_x * squared_y);
+
+    // See CosineDistance::distance: the double-precision norm avoids float 
overflow,
+    // and clamping keeps the result within the mathematically valid [-1, 1] 
range.
+    const double norm = std::sqrt(static_cast<double>(squared_x) * 
static_cast<double>(squared_y));
+    return std::clamp(static_cast<float>(dot_prod / norm), -1.0f, 1.0f);
 }
 FAISS_PRAGMA_IMPRECISE_FUNCTION_END
 
diff --git a/be/test/exprs/function/function_array_cosine_similarity_test.cpp 
b/be/test/exprs/function/function_array_cosine_similarity_test.cpp
index a4928276f1e..f281ef17fe2 100644
--- a/be/test/exprs/function/function_array_cosine_similarity_test.cpp
+++ b/be/test/exprs/function/function_array_cosine_similarity_test.cpp
@@ -99,8 +99,10 @@ TEST(function_cosine_similarity_test, cosine_similarity) {
 
         TestArray vec1 = {Float32(1.0), Float32(2.0), Float32(3.0)};
         TestArray vec2 = {Float32(3.0), Float32(5.0), Float32(7.0)};
-        // Expected: 34 / sqrt(14 * 83) = 34 / sqrt(1162) ≈ 0.9974149
-        float expected = 34.0f / std::sqrt(14.0f * 83.0f);
+        // Expected: 34 / sqrt(14 * 83) = 34 / sqrt(1162) ≈ 0.9974149.
+        // Mirror the production formula exactly (double-precision norm) so the
+        // exact float comparison in check_function matches bit-for-bit.
+        float expected = static_cast<float>(34.0 / std::sqrt(14.0 * 83.0));
         DataSet data_set = {{{vec1, vec2}, Float32(expected)}};
 
         static_cast<void>(check_function<DataTypeFloat32, false>(func_name, 
input_types, data_set));
@@ -156,4 +158,77 @@ TEST(function_cosine_similarity_test, cosine_similarity) {
     }
 }
 
+TEST(function_cosine_distance_test, cosine_distance) {
+    std::string func_name = "cosine_distance";
+    TestArray empty_arr;
+    InputTypeSet input_types = {PrimitiveType::TYPE_ARRAY, 
PrimitiveType::TYPE_FLOAT,
+                                PrimitiveType::TYPE_ARRAY, 
PrimitiveType::TYPE_FLOAT};
+
+    // identical vectors -> distance 0.0 (and crucially never a negative 
distance)
+    {
+        TestArray vec1 = {Float32(1.0), Float32(2.0), Float32(3.0)};
+        TestArray vec2 = {Float32(1.0), Float32(2.0), Float32(3.0)};
+        DataSet data_set = {{{vec1, vec2}, Float32(0.0)}};
+        static_cast<void>(check_function<DataTypeFloat32, false>(func_name, 
input_types, data_set));
+    }
+
+    // orthogonal vectors -> distance 1.0
+    {
+        TestArray vec1 = {Float32(1.0), Float32(0.0)};
+        TestArray vec2 = {Float32(0.0), Float32(1.0)};
+        DataSet data_set = {{{vec1, vec2}, Float32(1.0)}};
+        static_cast<void>(check_function<DataTypeFloat32, false>(func_name, 
input_types, data_set));
+    }
+
+    // opposite vectors -> distance 2.0
+    {
+        TestArray vec1 = {Float32(1.0), Float32(2.0), Float32(3.0)};
+        TestArray vec2 = {Float32(-1.0), Float32(-2.0), Float32(-3.0)};
+        DataSet data_set = {{{vec1, vec2}, Float32(2.0)}};
+        static_cast<void>(check_function<DataTypeFloat32, false>(func_name, 
input_types, data_set));
+    }
+
+    // zero vector and empty array keep the legacy fallback distance of 2.0
+    {
+        TestArray zero_vec = {Float32(0.0), Float32(0.0), Float32(0.0)};
+        TestArray vec = {Float32(1.0), Float32(2.0), Float32(3.0)};
+        DataSet data_set = {{{zero_vec, vec}, Float32(2.0)},
+                            {{empty_arr, empty_arr}, Float32(2.0)}};
+        static_cast<void>(check_function<DataTypeFloat32, false>(func_name, 
input_types, data_set));
+    }
+
+    // known value: 1 - 34 / sqrt(14 * 83). Mirror the production formula 
exactly.
+    {
+        TestArray vec1 = {Float32(1.0), Float32(2.0), Float32(3.0)};
+        TestArray vec2 = {Float32(3.0), Float32(5.0), Float32(7.0)};
+        float expected = 1.0f - static_cast<float>(34.0 / std::sqrt(14.0 * 
83.0));
+        DataSet data_set = {{{vec1, vec2}, Float32(expected)}};
+        static_cast<void>(check_function<DataTypeFloat32, false>(func_name, 
input_types, data_set));
+    }
+}
+
+// Regression tests for the numerical-stability fixes: large-magnitude vectors 
must
+// not overflow the norm (legacy sqrt(squared_x * squared_y) produced +inf and 
a
+// wrong result), and the cosine must stay within [-1, 1].
+TEST(function_cosine_numerical_stability_test, large_magnitude_no_overflow) {
+    InputTypeSet input_types = {PrimitiveType::TYPE_ARRAY, 
PrimitiveType::TYPE_FLOAT,
+                                PrimitiveType::TYPE_ARRAY, 
PrimitiveType::TYPE_FLOAT};
+
+    // squared_x = squared_y = 2e38 (within FLT_MAX), but squared_x * 
squared_y = 4e76
+    // overflows float. The double-precision norm keeps parallel vectors at 
cos = 1.0.
+    TestArray big1 = {Float32(1e19), Float32(1e19)};
+    TestArray big2 = {Float32(1e19), Float32(1e19)};
+
+    {
+        DataSet data_set = {{{big1, big2}, Float32(1.0)}};
+        static_cast<void>(
+                check_function<DataTypeFloat32, false>("cosine_similarity", 
input_types, data_set));
+    }
+    {
+        DataSet data_set = {{{big1, big2}, Float32(0.0)}};
+        static_cast<void>(
+                check_function<DataTypeFloat32, false>("cosine_distance", 
input_types, data_set));
+    }
+}
+
 } // namespace doris
diff --git 
a/regression-test/data/query_p0/sql_functions/array_functions/test_array_distance_functions.out
 
b/regression-test/data/query_p0/sql_functions/array_functions/test_array_distance_functions.out
index 071a20f477e..a2b3a0c837c 100644
--- 
a/regression-test/data/query_p0/sql_functions/array_functions/test_array_distance_functions.out
+++ 
b/regression-test/data/query_p0/sql_functions/array_functions/test_array_distance_functions.out
@@ -6,7 +6,7 @@
 3.741657
 
 -- !sql --
-0.002585053
+0.002585113
 
 -- !sql --
 2.0
@@ -18,7 +18,7 @@
 2.828427
 
 -- !sql --
-0.02536809
+0.02536815
 
 -- !sql --
 23.0
@@ -69,7 +69,7 @@
 -1.0
 
 -- !cosine_sim_distance_relation --
-0.9999999534439087
+1.0
 
 -- !cosine_sim_empty --
 0.0
@@ -78,12 +78,12 @@
 0.9838699
 
 -- !cosine_sim_small --
-0.9838699
+0.98387
 
 -- !cosine_sim_table --
 1      1.0
 2      0.0
-3      0.9746319
+3      0.9746318
 4      -1.0
 5      0.96
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris) branch master updated: [fix](function) Improve numerical robustness of cosine_distance / cosine_similarity (#62840)

Reply via email to