This is an automated email from the ASF dual-hosted git repository.
leerho pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git
The following commit(s) were added to refs/heads/master by this push:
new 8cff4668 Add Density Sketch Documentation
8cff4668 is described below
commit 8cff46682047ff572266a5c5fa784a091064e9ba
Author: Lee Rhodes <[email protected]>
AuthorDate: Sat Jan 24 16:47:55 2026 -0800
Add Density Sketch Documentation
---
_includes/toc.html | 7 ++++
docs/Density/DensitySketch.md | 66 ++++++++++++++++++++++++++++++++++++++
src/main/resources/docgen/toc.json | 6 ++++
3 files changed, 79 insertions(+)
diff --git a/_includes/toc.html b/_includes/toc.html
index 23550378..31def6e4 100644
--- a/_includes/toc.html
+++ b/_includes/toc.html
@@ -82,6 +82,13 @@
<li><a
href="{{site.docs_dir}}/Sampling/ReservoirSamplingSketches.html">•Reservoir
Sampling Sketches</a></li>
<li><a
href="{{site.docs_dir}}/Sampling/VarOptSamplingSketches.html">•VarOpt Sampling
Sketches</a></li>
</div>
+
+ <p id="density-&-coresets">
+ <a data-toggle="collapse" class="menu collapsed"
href="#collapse_density_&_coresets">Density & Coresets</a>
+ </p>
+ <div class="collapse" id="collapse_density_&_coresets">
+ <li><a href="{{site.docs_dir}}/Density/DensitySketch.html">•Density
Sketch</a></li>
+ </div>
</div>
<p id="system-integrations">
diff --git a/docs/Density/DensitySketch.md b/docs/Density/DensitySketch.md
new file mode 100644
index 00000000..357c0480
--- /dev/null
+++ b/docs/Density/DensitySketch.md
@@ -0,0 +1,66 @@
+---
+layout: doc_page
+---
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+## Contents
+<!-- TOC -->
+* [Density Sketch](#density-sketch)
+ * [Foundation Paper](#paper)
+ * [Key Highlights](#highlights)
+ * [Practical Implications](#implications)
+ * [Inspirational Code, Example and Tests](#inspiration)
+<!-- TOC -->
+
+<a id="density-sketch"></a>
+## DensitySketch
+**Quick summary:** This sketch builds a coreset from the given set of input
points as multi-dimensional vectors. Provides density estimate at a given point.
+
+<a id="paper"></a>
+### Our implementation is based on the following paper:<br>
+
+* Zohar Karnin, Edo Liberty "Discrepancy, Coresets, and Sketches in Machine
Learning"
+https://proceedings.mlr.press/v99/karnin19a/karnin19a.pdf
+ * The paper explores the relationship between class discrepancy and the
efficiency of data reduction techniques like coresets and streaming sketches.
+ * The core contribution is a general framework for creating small,
representative subsets of data (coresets) that maintain the statistical
properties of the original large dataset.
+
+<a id="highlights"></a>
+#### Key Highlights:
+* **New Complexity Measure:** The authors define "class discrepancy" as a way
to characterize the coreset complexity of different function families, similar
to how Rademacher complexity is used for generalization.
+* **Improved Coreset Sizes:** They prove the existence of
$\epsilon$-approximation coresets of size $O(\sqrt{d}/\epsilon)$ for several
common machine learning problems, including:
+ * Logistic regression
+ * Sigmoid activation loss
+ * Matrix covariance
+ * Kernel density estimation
+* **Gaussian Kernel Resolution:** The paper resolves a long-standing open
problem by matching the lower bound for the coreset complexity of Gaussian
kernel density estimation at $O(\sqrt{d}/\epsilon)$.
+* **Streaming Algorithms:** It introduces an exponential improvement to the
"merge-and-reduce" trick, leading to better streaming sketches for any problem
with low discrepancy.
+* **Deterministic Algorithm:** The authors provide a simple, deterministic
algorithm for finding low-discrepancy sequences and coresets for any positive
semi-definite kernel.
+
+<a id="implications"></a>
+#### Practical Implications:
+The findings allow for significantly faster optimization in large-scale
machine learning. By reducing a massive dataset into a much smaller coreset,
researchers can perform complex calculations (like training a logistic
regression model) with a fraction of the computational cost while maintaining a
high level of accuracy.
+
+<a id="inspiration"></a>
+### Our implementations was inspired by the following implementation, example,
and tests by Edo Liberty:
+* **Code:**
https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde.py
+* **Example**
https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde_example_usage.ipynb
+* **Tests**
https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde_test.py
+
+
+*(Note: much of this overview was generated by Gemini AI, it may contain
errors.)*
diff --git a/src/main/resources/docgen/toc.json
b/src/main/resources/docgen/toc.json
index a11ba460..d451ee26 100644
--- a/src/main/resources/docgen/toc.json
+++ b/src/main/resources/docgen/toc.json
@@ -68,6 +68,12 @@
{"class":"Doc", "desc" : "VarOpt Sampling Sketches",
"dir" : "Sampling", "file": "VarOptSamplingSketches" },
]
},
+
+ { "class":"Dropdown", "desc" : "Density & Coresets", "array":
+ [
+ {"class":"Doc", "desc" : "Density Sketch",
"dir" : "Density", "file": "DensitySketch" },
+ ]
+ },
]
},
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]