(datasketches-website) branch asf-site updated: Automatic Site Publish by Buildbot

git-site-role Mon, 04 Mar 2024 11:08:28 -0800

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 8f1921c3 Automatic Site Publish by Buildbot
8f1921c3 is described below

commit 8f1921c3925fa57dcc7372e8ecc864d700789e16
Author: buildbot <[email protected]>
AuthorDate: Mon Mar 4 19:08:20 2024 +0000

    Automatic Site Publish by Buildbot
---
 output/docs/Architecture/LargeScale.html |  2 +-
 output/docs/Theta/ThetaUpdateSpeed.html  | 28 ++++++++++++++--------------
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/output/docs/Architecture/LargeScale.html 
b/output/docs/Architecture/LargeScale.html
index 97a8be7c..a332b798 100644
--- a/output/docs/Architecture/LargeScale.html
+++ b/output/docs/Architecture/LargeScale.html
@@ -514,7 +514,7 @@
 -->
 <h2 id="designed-for-large-scale-computing-systems">Designed for Large-scale 
Computing Systems</h2>
 
-<h4 id="multiple-languages">Multiple Languages</h4>
+<h3 id="multiple-languages">Multiple Languages</h3>
 
 <ul>
   <li>The DataSketches library is now available in three languages, Java, C++, 
and Python. A fourth language, Go, is in development.</li>
diff --git a/output/docs/Theta/ThetaUpdateSpeed.html 
b/output/docs/Theta/ThetaUpdateSpeed.html
index 7463116b..efcda1fe 100644
--- a/output/docs/Theta/ThetaUpdateSpeed.html
+++ b/output/docs/Theta/ThetaUpdateSpeed.html
@@ -515,25 +515,25 @@
 <h2 id="theta-family-update-speed">Theta Family Update Speed</h2>
 
 <h3 id="resize-factor--x1">Resize Factor = X1</h3>
-<p>The following graph illustrates the update speed of 3 different sketches 
from the library: the Heap QuickSelect Sketch, the Off-Heap QuickSelect Sketch, 
and the Heap Alpha Sketch.
-The X-axis is the number of unique values presented to a sketch. The Y-axis is 
the average time to perform an update.  It is computed as the total time to 
update X-uniques divided by X-uniques.</p>
+<p>The following graph illustrates the update speed of 3 different sketches 
from the library: the Heap QuickSelect (QS) Sketch, the Off-Heap QuickSelect 
Sketch, and the Heap Alpha Sketch.
+The X-axis is the number of unique values presented to a sketch. The Y-axis is 
the average time to perform an update.  It is computed as the total time to 
update X-uniques, divided by X-uniques.</p>
 
-<p>The high values on the left are due to Java overhead and JVM warmup.  The 
humps in the middle of the graph are due to the internal hash table filling up 
and forcing an internal rebuild and reducing theta.  For this plot the sketches 
were configured with <i>k</i> = 4096. 
-The sawtooth peaks on the QS plots represent successive reqbuilds.  The 
downward slope on the right side of the hump is the sketch speeding up because 
it is rejecting more and more incoming hash values due to the continued 
reduction in the value of theta.
-The Alpha sketch (in red) uses a more advanced hash table update algorithm 
that defers the first rebuild until after theta has started decreasing.  This 
is the little spike just to the right of the hump.
+<p>The high values on the left are due to Java overhead and JVM warmup.  The 
spikes starting at about 4K uniques are due to the internal hashtable filling 
up and forcing an internal hashtable rebuild, which also reduces theta.  For 
this plot the sketches were configured with <i>k</i> = 4096. 
+The sawtooth peaks on the QuickSelect curves represent successive rebuilds.  
The downward slope on the right side of the largest spike is the sketch 
speeding up because it is rejecting more and more incoming hash values due to 
the continued reduction in the value of theta.
+The Alpha sketch (in red) uses a more advanced hashtable update algorithm that 
defers the first rebuild until after theta has started decreasing.  This is the 
little spike just to the right of the local maximum (at about 16K) of the curve.
 As the number of uniques continue to increase the update speed of the sketch 
becomes asymptotic to the speed of the hash function itself, which is about 6 
nanoseconds.</p>
 
 <p><img class="doc-img-full" src="/docs/img/theta/UpdateSpeed.png" 
alt="UpdateSpeed" /></p>
 
 <ul>
-  <li>The Heap Alpha Sketch (red) is the fastest sketch and primarily focused 
on real-time streaming environments and operates only on the Java heap.
-In this test setup and performing an “average” over all the test points from 8 
to 8 million uniques the Alpha sketch update rate averages about 100 million 
updates per second.</li>
-  <li>The Heap QuickSelect sketch (blue) is next, also on-heap, averages about 
81 million updates per second.</li>
-  <li>The Off-Heap QuickSelect sketch (green) runs off-heap in direct, native 
memory and averages about 63 million updates per second.</li>
+  <li>The Heap Alpha Sketch (red) is the fastest sketch of the Theta family 
and primarily focused on real-time streaming environments and operates only on 
the Java heap.
+Performing an “average” over all the test points from about 8 to 8 million 
uniques, the Alpha sketch update rate averages about 100 million updates per 
second.</li>
+  <li>The Heap QuickSelect sketch (blue) is next fastest.  It averages about 
81 million updates per second.</li>
+  <li>The Off-Heap QuickSelect sketch (green) runs in direct, native memory 
and averages about 63 million updates per second.</li>
   <li>The notations in the second line of the title are abbreviations as 
follows:
     <ul>
-      <li>LgK = 12 : The sketch was configured with K = 2^12 or 4096 bins.</li>
-      <li>LgT = 12,3 : The test harness was configured to start with 2^23 
trials on the left and logarithmically decrease the trials as the number of 
uniques increase down to 2^4 trials on the right.</li>
+      <li>LgK = 12 : The sketch was configured with K = 2^12 or 4096 nominal 
hash values.</li>
+      <li>LgT = 23,4 : The test harness was configured to start with 2^23 
trials on the left and logarithmically decrease the trials as the number of 
uniques increase down to 2^4 trials on the right.</li>
       <li>RF = X1 : Resize Factor = 1. The sketch was configured to start at 
maximum possible size in memory. This means there is no need for the sketch to 
request (allocate) more memory for the life of the sketch. This is the overall 
fastest configuration at the expense of allocating the maximum memory upfront. 
Other RF values are discussed below.  This has no impact on space required when 
serializing the sketch in compact mode.</li>
     </ul>
   </li>
@@ -546,8 +546,8 @@ In this test setup and performing an “average” over all the 
test points from
 
 <ul>
   <li>The blue curve is the same as the blue curve in the graph at the top of 
this page. 
-It was generated with <i>ResizeFactor = X1</i>, which means the sketch cache 
was initially created at full size, thus there is no resizing of the cache 
during the life of the sketch.  (There will be <i>rebuilding</i> but not 
resizing.)</li>
-  <li>The red curve is the same sketch but configured with <i>ResizeFactor = 
X2</i>, which means the sketch cache is initially created at a small size 
(enough for about 16 entries), and then when the sketch needs more space for 
the cache, it is resized by a factor of 2. Each of these resize events are 
clearly seen in the plot as sawtooth spikes in the speed performance where the 
sketch must pause, allocate a larger cache and then resize and reconstruct the 
hash table.  These spikes fall  [...]
+It was generated with <i>ResizeFactor = X1</i>, which means the sketch cache 
was initially created at full size, thus there is no resizing of the cache 
during the life of the sketch.  (There will be <i>rebuilding</i> but no 
resizing.)</li>
+  <li>The red curve is the same sketch but configured with <i>ResizeFactor = 
X2</i>, which means the sketch cache is initially created at a small size 
(enough for about 16 entries), and then when the sketch needs more space for 
the cache, it is resized by a factor of 2. Each of these resize events are 
clearly seen in the curve as sawtooth spikes where the sketch must pause, 
allocate a larger cache and reconstruct the hashtable.  These spikes fall at 
factors of 2 along the X-axis (with th [...]
   <li>The green curve is the same sketch but configured with <i>ResizeFactor = 
X8</i>.  An RF = X4 is also available but not shown.</li>
 </ul>
 
@@ -556,7 +556,7 @@ It was generated with <i>ResizeFactor = X1</i>, which means 
the sketch cache was
 <p>The tradeoff here is the classic memory size versus speed.  Suppose you 
have millions of sketches that need to be allocated and your input data is 
highly skewed (as is often the case).  Most of the sketches will only have a 
few entries and only a small fraction of all the sketches will actually go into 
estimation mode and require a full-sized cache.  The Resize Factor option 
allows a memory allocation that would be orders of magnitude smaller than would 
be required if all the sketches [...]
 
 <h3 id="how-these-graphs-were-generated">How these graphs were generated</h3>
-<p>The goal of these measurements was to measure the limits of how fast these 
sketches could update data from a continuous data stream not limited by system 
overhead, string or array processing. In order to remove random noise from the 
plots, each point on the graph represents an average of many trials.  For the 
low end of the graph the number of trials per point is 2^23 or 8M trials per 
point. At the high end of 8 million uniques per trial the number of trials per 
point is 2^4 or 16.</p>
+<p>The goal of these measurements was to measure the limits of how fast these 
sketches could update data from a continuous data stream not limited by system 
overhead, string or array processing. In order to remove random noise from the 
plots, each point on the graph represents an average of many trials.  For the 
low end of the graph the number of trials per point is 2^23 or 8M trials per 
point. At the high end, at 8 million uniques per trial, the number of trials 
per point is 2^4 or 16.</p>
 
 <p>It needs to be pointed out that these tests were designed to measure the 
maximum update speed under ideal conditions so “your mileage may vary”!
 Very few systems would actually be able to feed a single sketch at this rate 
so these plots represent an upper bound of performance, and not as realistic 
update rates in more complex systems environments. Nonetheless, this 
demonstrates that the sketches would consume very little of an overall system’s 
budget for updating, if there was one, and are quite suitable for real-time 
streams.</p>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datasketches-website) branch asf-site updated: Automatic Site Publish by Buildbot

Reply via email to