Copilot commented on code in PR #118:
URL: https://github.com/apache/datasketches-rust/pull/118#discussion_r3205884033


##########
datasketches/src/hash_value/value.rs:
##########
@@ -0,0 +1,88 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Shared value wrapper and hashing strategy support.
+
+use std::fmt;
+use std::hash::Hash;
+use std::hash::Hasher;
+use std::marker::PhantomData;
+use std::ops::Deref;
+use std::ops::DerefMut;
+
+/// A value wrapper that hashes its inner value with strategy `S`.
+///
+/// Most users should prefer the strategy-specific constructors.
+#[derive(Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
+pub struct Value<T, S> {
+    value: T,
+    strategy: PhantomData<fn() -> S>,
+}

Review Comment:
   `Value<T, S>` is derived as `Copy`, but several public wrappers use 
non-`Copy` inner types (e.g., `RawBytes<Vec<u8>>`, `RawBytes<String>`). This 
will prevent the crate from compiling because `#[derive(Copy)]` adds a `T: 
Copy` bound. Remove the `Copy` derive (and consider whether `Ord`/`Eq` derives 
are needed for all intended `T`).



##########
datasketches/src/hash_value/canonical_float.rs:
##########
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Canonical floating-point hash value wrappers.
+//!
+//! [`CanonicalFloat`] maps `f32` and `f64` values through a canonical `f64` 
bit pattern before
+//! hashing. Signed zero values hash the same, all NaN values use one 
canonical NaN bit pattern,
+//! and equal `f32`/`f64` values hash the same.
+//!
+//! This strategy is compatible with how other datasketches hashes 
floating-point numbers.
+
+use std::hash::Hash;
+use std::hash::Hasher;
+
+use super::value::HashStrategy;
+use super::value::Value;
+
+/// A floating-point value wrapper that uses canonical floating-point hashing.
+///
+/// See the [module level documentation](super) for more.
+pub type CanonicalFloat<T> = Value<T, CanonicalFloatStrategy>;
+
+/// Hashing strategy for [`CanonicalFloat`].
+#[doc(hidden)]
+pub struct CanonicalFloatStrategy;
+
+/// Create a canonical hashable value from a `f32` value.
+///
+/// `f32` values are converted to `f64` before hashing, so `from_f32(5.0)` 
hashes the same as
+/// `from_f64(5.0)`, but `from_f32(3.5)` hashes differently from 
`from_f64(3.5)`. Signed zero
+/// values hash the same, and all NaN values use one canonical NaN bit pattern.

Review Comment:
   The docs claim `from_f32(3.5)` hashes differently from `from_f64(3.5)`, but 
`3.5` is exactly representable in binary so both conversions produce the same 
`f64` value (and should hash the same under this implementation). Use a value 
that is not exactly representable in `f32` (e.g., `3.15`) or rephrase the 
statement to describe the real condition (values where `f32 as f64 != original 
f64`).
   



##########
datasketches/src/cpc/sketch.rs:
##########
@@ -191,18 +208,6 @@ impl CpcSketch {
         self.row_col_update(row_col);
     }
 

Review Comment:
   The PR removes the dedicated `update_f32`/`update_f64` methods for CPC 
sketches. Since CPC hashing is meant to match other datasketches 
implementations, consider retaining these methods as wrappers over 
`hash_value::canonical_float` (or deprecating them rather than removing) to 
avoid breaking callers and to reduce the chance that users accidentally hash 
floats with Rust's default `Hash` semantics.
   



##########
datasketches/tests/theta_sketch_test.rs:
##########
@@ -39,14 +40,30 @@ fn test_update_various_types() {
     sketch.update("string");
     sketch.update(42i64);
     sketch.update(42u64);
-    sketch.update_f64(3.15);
-    sketch.update_f64(3.15);
-    sketch.update_f32(3.15);
-    sketch.update_f32(3.15);
+    // where floating-point numbers has different representations
+    sketch.update(hash_value::canonical_float::from_f64(3.15));

Review Comment:
   Grammar in test comment: "floating-point numbers has" -> "floating-point 
numbers have" (also applies to the later comment about having the same 
representation).



##########
datasketches/src/theta/sketch.rs:
##########
@@ -107,48 +106,25 @@ impl ThetaSketch {
 
     /// Update the sketch with a hashable value.
     ///
-    /// For `f32`/`f64` values, use `update_f32`/`update_f64` instead.
+    /// You may use [`hash_value`](crate::hash_value) wrappers when matching 
other datasketches
+    /// implementations requires a specific value hashing strategy.
     ///
     /// # Examples
     ///
     /// ```
-    /// # use datasketches::theta::ThetaSketch;
-    /// let mut sketch = ThetaSketch::builder().build();
-    /// sketch.update("apple");
-    /// assert!(sketch.estimate() >= 1.0);
-    /// ```
-    pub fn update<T: Hash>(&mut self, value: T) {
-        self.table.try_insert(value);
-    }
-
-    /// Update the sketch with a f64 value.
-    ///
-    /// # Examples
+    /// use datasketches::hash_value;
+    /// use datasketches::theta::ThetaSketch;
     ///
-    /// ```
-    /// # use datasketches::theta::ThetaSketch;
     /// let mut sketch = ThetaSketch::builder().build();
-    /// sketch.update_f64(1.0);
+    /// sketch.update("apple");
     /// assert!(sketch.estimate() >= 1.0);
-    /// ```
-    pub fn update_f64(&mut self, value: f64) {
-        // Canonicalize double for compatibility with Java
-        let canonical = canonical_double(value);
-        self.update(canonical);
-    }
-
-    /// Update the sketch with a f32 value.
     ///
-    /// # Examples
-    ///
-    /// ```
-    /// # use datasketches::theta::ThetaSketch;
     /// let mut sketch = ThetaSketch::builder().build();
-    /// sketch.update_f32(1.0);
+    /// sketch.update(hash_value::raw_bytes::from_str("apple"));
     /// assert!(sketch.estimate() >= 1.0);
     /// ```
-    pub fn update_f32(&mut self, value: f32) {
-        self.update_f64(value as f64);
+    pub fn update<T: Hash>(&mut self, value: T) {
+        self.table.try_insert(value);
     }
 

Review Comment:
   Removing the dedicated `update_f32`/`update_f64` APIs makes it easy for 
callers to accidentally hash floats using Rust's default `Hash` semantics 
(distinct NaN payloads, signed zero, f32 vs f64 differences), which is exactly 
the class of compatibility issues this PR is addressing. Consider keeping 
`update_f32`/`update_f64` as thin convenience wrappers around 
`hash_value::canonical_float::{from_f32, from_f64}` (possibly deprecated) to 
preserve ergonomics and avoid accidental mismatches.
   



##########
datasketches/src/theta/sketch.rs:
##########
@@ -107,48 +106,25 @@ impl ThetaSketch {
 
     /// Update the sketch with a hashable value.
     ///
-    /// For `f32`/`f64` values, use `update_f32`/`update_f64` instead.
+    /// You may use [`hash_value`](crate::hash_value) wrappers when matching 
other datasketches
+    /// implementations requires a specific value hashing strategy.

Review Comment:
   Grammar: "implementations requires" should be "implementations require".
   



##########
datasketches/src/cpc/sketch.rs:
##########
@@ -172,7 +171,25 @@ impl CpcSketch {
 
     /// Update the sketch with a hashable value.
     ///
-    /// For `f32`/`f64` values, use `update_f32`/`update_f64` instead.
+    /// You may use [`hash_value`](crate::hash_value) wrappers when matching 
other datasketches
+    /// implementations requires a specific value hashing strategy.

Review Comment:
   Grammar: "implementations requires" should be "implementations require".
   



##########
datasketches/src/hll/mod.rs:
##########
@@ -206,12 +206,15 @@ impl Coupon {
 
     /// Compute the HLL coupon for a hashable value.
     ///
+    /// You may use [`hash_value`](crate::hash_value) wrappers when matching 
other datasketches
+    /// implementations requires a specific value hashing strategy.

Review Comment:
   Grammar: "implementations requires" should be "implementations require".
   



##########
datasketches/src/hll/sketch.rs:
##########
@@ -158,9 +158,12 @@ impl HllSketch {
 
     /// Update the sketch with a value.
     ///
-    /// Accepts any type that implements [`Hash`].  The value is hashed and 
converted to
+    /// Accepts any type that implements [`Hash`]. The value is hashed and 
converted to
     /// an internal coupon, which is then inserted into the sketch.
     ///
+    /// You may use [`hash_value`](crate::hash_value) wrappers when matching 
other datasketches
+    /// implementations requires a specific value hashing strategy.

Review Comment:
   Grammar: "implementations requires" should be "implementations require".
   



##########
datasketches/tests/theta_sketch_test.rs:
##########
@@ -39,14 +40,30 @@ fn test_update_various_types() {
     sketch.update("string");
     sketch.update(42i64);
     sketch.update(42u64);
-    sketch.update_f64(3.15);
-    sketch.update_f64(3.15);
-    sketch.update_f32(3.15);
-    sketch.update_f32(3.15);
+    // where floating-point numbers has different representations
+    sketch.update(hash_value::canonical_float::from_f64(3.15));
+    sketch.update(hash_value::canonical_float::from_f64(3.15));
+    sketch.update(hash_value::canonical_float::from_f32(3.15));
+    sketch.update(hash_value::canonical_float::from_f32(3.15));
     sketch.update([1u8, 2, 3]);
 
     assert!(!sketch.is_empty());
     assert_eq!(sketch.estimate(), 5.0);
+
+    let mut sketch = ThetaSketch::builder().lg_k(12).build();
+
+    sketch.update("string");
+    sketch.update(42i64);
+    sketch.update(42u64);
+    // where floating-point numbers has the same representation
+    sketch.update(hash_value::canonical_float::from_f64(5.0));

Review Comment:
   Grammar in test comment: "floating-point numbers has" -> "floating-point 
numbers have".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to