(datasketches-python) 01/01: Add get_PMF and get_CDF to t-digest, add t-digest docs, update tdigest merge to const input

jmalkin Fri, 10 Jan 2025 13:58:26 -0800

This is an automated email from the ASF dual-hosted git repository.

jmalkin pushed a commit to branch tdigest_docs
in repository https://gitbox.apache.org/repos/asf/datasketches-python.git


commit d2d103983bc4a08b4abe4fe2767b06fa2b8fb519
Author: Jon Malkin <[email protected]>
AuthorDate: Fri Jan 10 13:57:58 2025 -0800

    Add get_PMF and get_CDF to t-digest, add t-digest docs, update tdigest 
merge to const input
---
 CMakeLists.txt                    |  2 +-
 NOTICE                            |  4 ++--
 README.md                         |  5 +++-
 docs/source/quantiles/index.rst   | 14 +++++++----
 docs/source/quantiles/kll.rst     |  5 ----
 docs/source/quantiles/tdigest.rst | 50 +++++++++++++++++++++++++++++++++++++++
 src/tdigest_wrapper.cpp           | 29 ++++++++++++++++++++++-
 tests/tdigest_test.py             | 17 +++++++++++++
 8 files changed, 111 insertions(+), 15 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 4e9ba96..2e74346 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -117,7 +117,7 @@ cmake_policy(SET CMP0097 NEW)
 include(ExternalProject)
 ExternalProject_Add(datasketches
   GIT_REPOSITORY https://github.com/apache/datasketches-cpp.git
-  GIT_TAG 5.1.0
+  GIT_TAG 5.2.0
   GIT_SHALLOW true
   GIT_SUBMODULES ""
   INSTALL_DIR /tmp/datasketches
diff --git a/NOTICE b/NOTICE
index 1305168..1668e07 100644
--- a/NOTICE
+++ b/NOTICE
@@ -1,9 +1,9 @@
 Apache DataSketches Python
-Copyright 2024 The Apache Software Foundation
+Copyright 2025 The Apache Software Foundation
 
 Copyright 2015-2018 Yahoo Inc.
 Copyright 2019-2020 Verizon Media
-Copyright 2021 Yahoo Inc.
+Copyright 2021- Yahoo Inc.
 
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
diff --git a/README.md b/README.md
index 2641de8..9961496 100644
--- a/README.md
+++ b/README.md
@@ -75,10 +75,13 @@ The unit tests are mostly structured in a tutorial style 
and can be used as a re
   - `vector_of_kll_floats_sketches`
 - Kolmogorov-Smirnov Test
   - `ks_test` applied to a pair of matched-type Absolute Error quantiles 
sketches
-- Density
+- Kernel Density
   - `density_sketch`
 - Count-min sketch
   - `count_min_sketch`
+- t-digest
+  - tdigest_float
+  - tdigest_double
 
 ## Known Differences from C++
 
diff --git a/docs/source/quantiles/index.rst b/docs/source/quantiles/index.rst
index b1928f6..bf53ea3 100644
--- a/docs/source/quantiles/index.rst
+++ b/docs/source/quantiles/index.rst
@@ -10,17 +10,21 @@ in the stream.
 These sketches may be used to compute approximate histograms, Probability Mass 
Functions (PMFs), or
 Cumulative Distribution Functions (CDFs).
 
-The library provides three types of quantiles sketches, each of which has 
generic items as well as versions
-specific to a given numeric type (e.g. integer or floating point values). All 
three types provide error
-bounds on rank estimation with proven probabilistic error distributions.
+The library provides four types of quantiles sketches, three of which have 
generic items as well as versions
+specific to a given numeric type (e.g. integer or floating point values). 
Those three types provide error
+bounds on rank estimation with proven probabilistic error distributions. 
t-digest is a heuristic-based sketch
+that works only on numeric data, and while the error properties are not 
guaranteed, the sketch typically
+does a good job with small storage.
 
-  * KLL: Provides uniform rank estimation error over the entire range
+  * KLL: Provides uniform rank estimation error over the entire range.
   * REQ: Provides relative rank error estimates, which decreases approaching 
either the high or low end values.
+  * t-digest: Relative rank error estimates, heuristic-based without 
guarantees but quite compact with generally very good error properties.
   * Classic quantiles: Largely deprecated in favor of KLL, also provides 
uniform rank estimation error. Included largely for backwards compatibility 
with historic data.
 
 .. toctree::
   :maxdepth: 1
-   
+
   kll
   req
+  tdigest
   quantiles_depr
\ No newline at end of file
diff --git a/docs/source/quantiles/kll.rst b/docs/source/quantiles/kll.rst
index 0e54b44..ab7f0c4 100644
--- a/docs/source/quantiles/kll.rst
+++ b/docs/source/quantiles/kll.rst
@@ -14,10 +14,6 @@ The analysis is obtained using `get_quantile()` function or 
the
 inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and 
`get_cdf()`
 (Cumulative Distribution Function).
 
-As of May 2020, this implementation produces serialized sketches which are 
binary-compatible
-with the equivalent Java implementation only when template parameter `T = 
float`
-(32-bit single precision values).
-
 Given an input stream of `N` items, the `natural rank` of any specific
 item is defined as its index `(1 to N)` in inclusive mode
 or `(0 to N-1)` in exclusive mode
@@ -168,4 +164,3 @@ Additionally, the interval may be quite large for certain 
distributions.
     .. rubric:: Non-static Methods:
 
     .. automethod:: __init__
-
diff --git a/docs/source/quantiles/tdigest.rst 
b/docs/source/quantiles/tdigest.rst
new file mode 100644
index 0000000..1af4a9e
--- /dev/null
+++ b/docs/source/quantiles/tdigest.rst
@@ -0,0 +1,50 @@
+t-digest
+--------
+
+.. currentmodule:: datasketches
+
+The implementation in this library is based on the MergingDigest described in
+`Computing Extremely Accurate Quantiles Using t-Digests 
<https://arxiv.org/abs/1902.04023>`_ by Ted Dunning and Otmar Ertl.
+
+The implementation in this library has a few differences from the reference 
implementation associated with that paper:
+
+* Merge does not modify the input
+* Derialization similar to other sketches in this library, although reading 
the reference implementation format is supported
+
+Unlike all other algorithms in the library, t-digest is empirical and has no 
mathematical basis for estimating its error
+and its results are dependent on the input data. However, for many common data 
distributions, it can produce excellent results.
+t-digest also operates only on numeric data and, unlike the the quantiles 
family algorithms in the library which return quantile
+approximations from the input domain, t-digest interpolates values and will 
hold and return data points not seen in the input.
+
+The closest alternative to t-digest in this library is REQ sketch. It 
prioritizes one chosen side of the rank domain:
+either low rank accuracy or high rank accuracy. t-digest (in this 
implementation) prioritizes both ends of the rank domain
+and has lower accuracy towards the middle of the rank domain (median).
+
+Measurements show that t-digest is slightly biased (tends to underestimate low 
ranks and overestimate high ranks), while still
+doing very well close to the extremes. The effect seems to be more pronounced 
with more input values.
+
+.. autoclass:: tdigest_float
+    :members:
+    :undoc-members:
+    :exclude-members: deserialize
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+
+    .. rubric:: Non-static Methods:
+
+    .. automethod:: __init__
+
+.. autoclass:: tdigest_double
+    :members:
+    :undoc-members:
+    :exclude-members: deserialize
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+
+    .. rubric:: Non-static Methods:
+
+    .. automethod:: __init__
diff --git a/src/tdigest_wrapper.cpp b/src/tdigest_wrapper.cpp
index 059ee1d..43248fb 100644
--- a/src/tdigest_wrapper.cpp
+++ b/src/tdigest_wrapper.cpp
@@ -24,6 +24,7 @@
 #include <nanobind/nanobind.h>
 #include <nanobind/make_iterator.h>
 #include <nanobind/stl/string.h>
+#include <nanobind/stl/vector.h>
 #include <nanobind/ndarray.h>
 
 #include "tdigest.hpp"
@@ -44,7 +45,7 @@ void bind_tdigest(nb::module_ &m, const char* name) {
     .def("__copy__", [](const tdigest<T>& sk) { return tdigest<T>(sk); })
     .def("update", (void(tdigest<T>::*)(T)) &tdigest<T>::update, 
nb::arg("item"),
         "Updates the sketch with the given value")
-    .def("merge", (void(tdigest<T>::*)(tdigest<T>&)) &tdigest<T>::merge, 
nb::arg("sketch"),
+    .def("merge", (void(tdigest<T>::*)(const tdigest<T>&)) &tdigest<T>::merge, 
nb::arg("sketch"),
          "Merges the provided sketch into this one")
     .def("__str__", [](const tdigest<T>& sk) { return sk.to_string(); },
          "Produces a string summary of the sketch")
@@ -71,6 +72,32 @@ void bind_tdigest(nb::module_ &m, const char* name) {
     .def("get_serialized_size_bytes", &tdigest<T>::get_serialized_size_bytes,
          nb::arg("with_buffer")=false,
          "Returns the size of the serialized sketch, in bytes")
+    .def(
+        "get_pmf",
+        [](const tdigest<T>& sk, const std::vector<T>& split_points) {
+          return sk.get_PMF(split_points.data(), split_points.size());
+        },
+        nb::arg("split_points"),
+        "Returns an approximation to the Probability Mass Function (PMF) of 
the input stream "
+        "given a set of split points (values).\n"
+        "If the sketch is empty this returns an empty vector.\n"
+        "split_points is an array of m unique, monotonically increasing float 
values "
+        "that divide the real number line into m+1 consecutive disjoint 
intervals.\n"
+        "It is not necessary to include either the min or max values in these 
split points."
+    )
+    .def(
+        "get_cdf",
+        [](const tdigest<T>& sk, const std::vector<T>& split_points) {
+          return sk.get_CDF(split_points.data(), split_points.size());
+        },
+        nb::arg("split_points"),
+        "Returns an approximation to the Cumulative Distribution Function 
(CDF), which is the "
+        "cumulative analog of the PMF, of the input stream given a set of 
split points (values).\n"
+        "If the sketch is empty this returns an empty vector.\n"
+        "split_points is an array of m unique, monotonically increasing float 
values "
+        "that divide the real number line into m+1 consecutive disjoint 
intervals.\n"
+        "It is not necessary to include either the min or max values in these 
split points."
+    )
     ;
 
     add_serialization<T>(tdigest_class);
diff --git a/tests/tdigest_test.py b/tests/tdigest_test.py
index fab32c4..de36f69 100644
--- a/tests/tdigest_test.py
+++ b/tests/tdigest_test.py
@@ -50,6 +50,15 @@ class TdigestTest(unittest.TestCase):
       self.assertFalse(td.is_empty())
       self.assertEqual(td.get_total_weight(), n)
 
+      # we can get the PMF and CDF
+      pmf = td.get_pmf([-0.5, 0.0, 0.5])
+      self.assertEqual(len(pmf), 4)
+      self.assertAlmostEqual(sum(pmf), 1.0)
+
+      cdf = td.get_cdf([0.0])
+      self.assertEqual(len(cdf), 2)
+      self.assertAlmostEqual(cdf[0], 0.5, delta = 0.05)
+
       # we can define a new tdiget with a different distribution, then merge 
them
       td2 = tdigest_double()
       td2.update(np.random.normal(loc=2.0, size=n))
@@ -89,6 +98,14 @@ class TdigestTest(unittest.TestCase):
       self.assertFalse(td.is_empty())
       self.assertEqual(td.get_total_weight(), n)
 
+      pmf = td.get_pmf([-0.5, 0.0, 0.5])
+      self.assertEqual(len(pmf), 4)
+      self.assertAlmostEqual(sum(pmf), 1.0)
+
+      cdf = td.get_cdf([0.0])
+      self.assertEqual(len(cdf), 2)
+      self.assertAlmostEqual(cdf[0], 0.5, delta = 0.05)
+
       td2 = tdigest_float()
       td2.update(np.random.normal(loc=2.0, size=n))
       td.merge(td2)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datasketches-python) 01/01: Add get_PMF and get_CDF to t-digest, add t-digest docs, update tdigest merge to const input

Reply via email to