(datasketches-website) branch master updated: add material to eb-pps sampling sketches

leerho Sat, 24 Jan 2026 15:14:06 -0800

This is an automated email from the ASF dual-hosted git repository.

leerho pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 2d53e05f add material to eb-pps sampling sketches
2d53e05f is described below

commit 2d53e05f7267e7667e91afa6f3b0fe1de8d961dd
Author: Lee Rhodes <[email protected]>
AuthorDate: Sat Jan 24 15:13:41 2026 -0800

    add material to eb-pps sampling sketches
---
 docs/Sampling/EB-PPS_SamplingSketches.md | 127 +++++++++++++++++++++++++++++--
 1 file changed, 121 insertions(+), 6 deletions(-)

diff --git a/docs/Sampling/EB-PPS_SamplingSketches.md 
b/docs/Sampling/EB-PPS_SamplingSketches.md
index 0eee9068..14775e7a 100644
--- a/docs/Sampling/EB-PPS_SamplingSketches.md
+++ b/docs/Sampling/EB-PPS_SamplingSketches.md
@@ -22,6 +22,23 @@ layout: doc_page
 ## Contents
 <!-- TOC -->
 * [EB-PPS Sampling Sketches](#eb-pps-sampling-sketch)
+    * [Key Concepts of EB-PPS](#key-concepts-of-eb-pps)
+    * [Computational Advantages](#computational-advantages)
+    * [Applications in Machine Learning](#applications-in-ML)
+    * [EB-PPS versus VarOpt for large datasets](#eb-pps-vs-varopt1)
+    * [The fixed sample size verses PPS property 
tradeoff](#sample-size-vs-pps-tradeoff)
+        * [The Core Conflict](#the-core-conflict)
+        * [Prioritizing Fixed Sample Size (e.g., VarOpt)](#fixed-sample-size)
+        * [Prioritizing the PPS Property (e.g., EB-PPS)](#pps-property)
+        * [Summary Table](#summary-table)
+    * [How EB-PPS works with varying weights](#eb-pps-weights)
+        * [EB-PPS versus VarOpt Table ](#eb-pps-vs-varopt2)
+    * [Situations that make PPS vital for classifier 
performance](#pps-for-classifiers)
+        * 1. [Training Bayes-Optimal Classifiers](#training-classifiers)
+        * 2. [Handling Severe Class Imbalance](#class-imbalance)
+        * 3. [Maintaining Probability Calibration](#probability-calibration)
+        * 4. [Legal and Ethical Fairness](#legal-ethical-fairness)
+
 <!-- TOC -->
 
 <a id="eb-pps-sampling-sketch"></a>
@@ -40,10 +57,9 @@ Key features and benefits of EB-PPA include:
 * **Application in Machine Learning:** This method is particularly useful for 
training classifiers on imbalanced datasets, where it helps in overcoming 
issues with biased results.
 * **Comparison with Other Methods:** In the above referenced paper, EB-PPS is 
compared to other methods such as Conditional Poisson Sampling (CPS) and Pareto 
PPS sampling, which also aim to have inclusion probabilities proportional to 
given size measures.
 
-EB-PPS is advantageous for its simplicity, efficiency, and ability to balance 
the need for exact PPS with the constraint of a fixed sample size.
-
-EB-PPS is a modern sampling scheme designed to resolve a fundamental conflict 
in statistical sampling: the inability to simultaneously guarantee exact 
**Probability Proportional to Size (PPS)** and a fixed sample size for all 
datasets. 
+EB-PPS is a modern sampling scheme designed to resolve a fundamental conflict 
in statistical sampling: the inability to simultaneously guarantee exact 
**Probability Proportional to Size (PPS)** and a fixed sample size for all 
datasets. It is advantageous for its simplicity, efficiency, and ability to 
balance the need for exact PPS with the constraint of a fixed sample size. 
 
+<a id="key-concepts-of-eb-pps"></a>
 ### Key Concepts of EB-PPS
 Traditional PPS schemes often prioritize keeping a fixed sample size *k*, 
which sometimes forces them to violate the PPS property (where each item's 
selection probability must be exactly proportional to its weight). EB-PPS flips 
this trade-off: 
 
@@ -51,19 +67,118 @@ Traditional PPS schemes often prioritize keeping a fixed 
sample size *k*, which
 * **Bounded Sample Size:** It uses the target *k* as a strict upper bound 
rather than a fixed requirement.
 * **Optimality:** If the sample size cannot be exactly *k* while maintaining 
strict PPS, the algorithm produces a sample with the maximal possible expected 
size and minimal variance.
 
+<a id="computational-advantages"></a>
 ### Computational Advantages
 
-EB-PPS is particularly suited for high-volume data environments and 2026+ era 
big data analytics: 
+EB-PPS is particularly suited for high-volume data environments and modern era 
big data analytics: 
 
 * **One-Pass Streaming:** It processes items sequentially and does not need to 
know the total population size in advance.
 
-* **Efficiency:** It features an amortized processing time of *O(1)* **per 
item**, making it computationally superior to older state-of-the-art algorithms 
like VarOpt, which has *O(\log n)* complexity.
+* **Efficiency:** It features an amortized processing time of *O(1)* **per 
item**, making it computationally superior to older state-of-the-art algorithms 
like VarOpt, which has *O(log k)* complexity.
 
 * **Memory Management:** By bounding the sample size, it prevents storage 
overflows during real-time data ingestion.
 
+<a id="applications-in-ML"></a>
 ### Applications in Machine Learning
 
-A primary 2026 application for EB-PPS is training classifiers on imbalanced 
datasets: 
+A primary modern application for EB-PPS is training classifiers on imbalanced 
datasets: 
 
 * **Bayes-Optimality:** By maintaining exact PPS, it allows standard training 
techniques (designed for 0-1 loss) to produce Bayes-optimal classifiers even 
when the true loss function is imbalanced.
 * **Superior Performance:** Research indicates that smaller, well-curated 
EB-PPS samples can outperform larger samples from traditional schemes (like 
VarOpt) because they avoid the bias introduced by violating PPS.
+
+<a id="eb-pps-vs-varopt1"></a>
+### EB-PPS versus VarOpt for large datasets
+EB-PPS is more efficient than VarOpt for large, high-volume datasets due to 
its superior computational complexity and optimized sample curation. Key 
advantages include:
+
+* **Optimal Computational Complexity:** EB-PPS achieves an *O(1)* **amortized 
processing time** per item. This is an improvement over VarOpt, which typically 
requires *O(log k)* (or *O(log log k)* in some streaming implementations) time 
per item to maintain its internal structures.
+* **Reduced Sample Size Requirements:** Because EB-PPS maintains strict PPS 
properties, it can achieve superior results with smaller samples. Research 
shows EB-PPS can produce classifiers with roughly 36% less loss than VarOpt 
while using sample sizes that are **one-third to one-half as large.** Smaller 
samples directly translate to faster downstream analytics and lower memory 
overhead.
+* **Simplified Data Structures:** EB-PPS uses a single array of size *k* and 
does not require complex heap-based management, making it easier to implement 
and maintain in parallel systems.
+* **Avoidance of Bias-Correction Overhead:** VarOpt often violates exact PPS 
to maintain a fixed sample size, requiring complex customized loss functions or 
additional training procedures to correct this bias. EB-PPS avoids these 
secondary computational costs by ensuring inclusion probabilities are "as 
advertised" from the start.
+
+<a id="sample-size-vs-pps-tradeoff"></a>
+### The fixed sample size verses PPS property tradeoff
+The primary trade-off in weighted sampling remains the tension between 
maintaining a fixed sample size *k* and strictly adhering to the PPS property. 
These two goals are mathematically incompatible in many real-world datasets.
+
+<a id="the-core-conflict"></a>
+#### The Core Conflict
+The fundamental issue arises when some items have such high weights that they 
"crowd out" the rest of the sample: 
+
+* **Incompatibility:** If an item's weight is disproportionately large 
compared to the total weight of the population, a fixed *k* may force the 
algorithm to sample lower-weight items more frequently than their weight allows 
just to fill the *k* slots.
+* **Statistical Implication:** When a fixed size is prioritized, the inclusion 
probabilities are no longer truly proportional to size (PPS), introducing 
statistical bias.
+
+<a id="fixed-sample-size"></a>
+#### Prioritizing Fixed Sample Size (e.g., VarOpt)
+Most traditional PPS schemes, like VarOpt, prioritize keeping the sample size 
exactly at *k*.
+
+* **Advantage:** Predictable resource consumption and downstream processing 
time, as the analyst knows exactly how many data points will be returned.
+* **Disadvantage:** Violates the PPS property when large-weight items exist. 
This requires complex bias-correction or specialized loss functions during 
model training.
+
+<a id="pps-property"></a>
+#### Prioritizing the PPS Property (e.g., EB-PPS)
+Modern approaches like EB-PPS prioritize strict adherence to the PPS property 
over a fixed count.
+
+* **Advantage:** Maintains "as advertised" inclusion probabilities. This 
allows standard training algorithms to produce **Bayes-optimal classifiers** 
directly, which is critical for imbalanced datasets.
+* **Disadvantage:** The sample size becomes a variable (though bounded by 
*k*). The actual size may fall below *k* if strict PPS compliance necessitates 
a smaller, more representative sample. 
+
+<a id="summary-table"></a>
+#### Summary Table
+
+| Priority          | Strategy                    | Consequence              |
+|:------------------|:----------------------------|:-------------------------|
+| <b>Fixed Size</b> | Fill k slots at all costs.  | PPS property is violated; 
statistical bias is introduced. |
+| <b>Strict PPS</b> | Use probabilities strictly proportional to weights. | 
Sample size becomes a variable upper-bounded by *k*. |
+
+<a id="eb-pps-weights"></a>
+### How EB-PPS works with varying weights
+The algorithm's efficiency stems from how it handles incoming items regardless 
of their individual "size" or weight:
+
+* **Flat Data Structure:** EB-PPS uses only a single array of size *k* (where 
*k* is the target sample size). Unlike VarOpt, which relies on a heap or 
priority queue to manage items by weight (resulting in *O(log k)* costs), 
EB-PPS performs simple updates to its array.
+* **Constant-Time Decision Logic:** When a new item arrives with a specific 
weight, the algorithm performs a fixed number of arithmetic operations to 
determine if it should be included. These operations do not scale with the 
total population size or the sample size *k*.
+* **Amortized Rebalancing:** While most items are processed in a single step, 
certain items may trigger a "reweighting" or rebalancing of the internal 
sample. Because these events occur with predictable, low frequency relative to 
the total number of items, the cost is spread out (amortized) to maintain an 
average of *O(1)* **per item**.
+* **Weight Independence:** The computational cost remains constant whether the 
item has a very large weight (making it almost certain to be included) or a 
very small weight (making it rare). The algorithm only processes the current 
item's weight against its internal state, never needing to re-sort or re-scan 
the entire reservoir for every new entry.
+
+<a id="eb-pps-vs-varopt2"></a>
+#### EB-PPS versus VarOpt Table
+
+| Feature | EB-PPS | VarOpt |
+|:--------|:-------|:-------|
+| <b>Complexity per Item</b> | *O(1)* (Amortized) | *O(log k)* (Standard) |
+| <b>Data Structure</b> | Simple Array | Priority Queue / Heap |
+| <b>PPS Property</b> | <b>Strictly enforced</b> | Violated if necessary to 
fix *k* |
+| <b>Sample Size *(k)*</b> | Upper bound (Variable) | Fixed (Strict) |
+
+<a id="pps-for-classifiers"></a>
+### Situations that make PPS vital for classifier performance
+Today, strict adherence to the Probability Proportional to Size (PPS) 
property—as prioritized by schemes like EB-PPS—is considered vital for 
classifier performance in the following high-stakes scenarios:
+
+<a id="training-classifiers"></a>
+#### 1. Training Bayes-Optimal Classifiers
+
+For a classifier to be truly "optimal," it must minimize expected risk based 
on the data's true underlying distribution.
+
+* **The PPS Link:** Exact PPS sampling ensures that the training subset 
maintains the statistical integrity of the original dataset's weights.
+* **Direct Training:** When PPS is strictly maintained, you can use standard 
training algorithms (optimized for 0-1 loss) to produce Bayes-optimal decision 
boundaries. If PPS is violated (as often happens with fixed-size schemes like 
VarOpt), the resulting decision boundary shifts, leading to suboptimal 
performance unless complex, custom loss-correction is applied.
+
+<a id="class-imbalance"></a>
+#### 2. Handling Severe Class Imbalance
+In datasets where the minority class is extremely rare (e.g., fraud detection 
or rare disease diagnosis), small errors in inclusion probability can cause the 
classifier to ignore critical but rare signals.
+
+* **Avoiding "Majority Bias":** Inaccurate sampling often leads a model to 
simply predict the majority class for all instances to achieve high "accuracy" 
while failing at its actual task.
+* **Exact Representation:** Strict PPS ensures that the weight assigned to 
these rare cases is exactly preserved in the sample, forcing the model to learn 
the minority class features correctly.
+
+<a id="probability-calibration"></a>
+#### 3. Maintaining Probability Calibration
+Calibration refers to the model's ability to provide accurate probability 
estimates (e.g., "there is a 70% chance of malignancy") rather than just a 0/1 
label.
+
+* **Clinical Utility:** In healthcare, over- or under-estimating risks can 
lead to dangerous overtreatment or missed diagnoses.
+* **PPS Advantage:** Because EB-PPS does not distort the inclusion 
probabilities to force a fixed sample size, the resulting model is inherently 
better calibrated. The probabilities it outputs reflect the true risk levels of 
the original population.
+
+<a id="legal-ethical-fairness"></a>
+#### 4. Legal and Ethical Fairness
+Today, algorithmic fairness is a major regulatory focus. Biased sampling is a 
primary source of "AI bias" that leads to prejudiced outcomes in lending, 
hiring, or healthcare. 
+
+* **Predictive Parity:** Strict PPS allows for the construction of fair 
Bayes-optimal classifiers that satisfy "predictive parity," ensuring that the 
model's error rates and accuracy are consistent across different protected 
groups (e.g., race, gender).
+* **Liability Mitigation:** Using sampling methods that guarantee statistical 
integrity (like EB-PPS) helps developers demonstrate that their models were 
trained on a representative, unbiased representation of the data, reducing 
legal liability for "erroneous training data". 
+
+
+*(Note: much of this overview was generated by Gemini AI, it may contain 
errors.)*


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datasketches-website) branch master updated: add material to eb-pps sampling sketches

Reply via email to