This is an automated email from the ASF dual-hosted git repository. alsay pushed a commit to branch tuple_readme in repository https://gitbox.apache.org/repos/asf/datasketches-bigquery.git
commit f30334c612f246351c3a4610749545673dba56d6 Author: AlexanderSaydakov <[email protected]> AuthorDate: Tue Feb 11 19:03:24 2025 -0800 tuple readme --- tuple/README.md | 19 +++++++++++++++---- tuple/README_template.md | 19 +++++++++++++++---- 2 files changed, 30 insertions(+), 8 deletions(-) diff --git a/tuple/README.md b/tuple/README.md index 66f0125..32f9a44 100644 --- a/tuple/README.md +++ b/tuple/README.md @@ -19,10 +19,21 @@ # Apache DataSketches Tuple Sketches for Google BigQuery -Tuple sketches extend the functionality of Theta sketches by -allowing you to associate a summary value with each item in the set. This -enables calculations like the sum, minimum, or maximum of values associated with -the distinct items. +Tuple sketches extend the functionality of Theta sketches by adding a Summary object associated +with each distinct key retained by the sketch. When the identifier of an input pair (identifier, value) matches a unique +key of the sketch, the associated Summary of that key can be modified based on user-defined policy. +The set of all Summary values collected by the sketch represents a uniform random sample over the unique identifiers +subset of all identifiers. This enables the use of common statistical computations of the Summary values, which can be extrapolated to the entire +set of unique identifiers. + +The underlying C++ library supports Summary objects of any type (including complex types) and arbitrary policies +of updating Summaries during the sketch building process, and combining these Summaries during union and intersection set operations. + +The current set of functions for BigQuery implements Summary objects as INT64 (unsigned in C++) with SUM, MIN, MAX, ONE (constant 1) policies (modes). +This enables calculations like the sum, average, minimum, or maximum of the Summary values associated with the distinct keys. + +This implementation can serve as an example of how to implement Tuple sketch with a Summary type and policy of your choice. +We are open to suggestions on what Summary types and policies to consider for inclusion here. Please visit [Tuple Sketches](https://datasketches.apache.org/docs/Tuple/TupleSketches.html) diff --git a/tuple/README_template.md b/tuple/README_template.md index 5efa530..a3abc98 100644 --- a/tuple/README_template.md +++ b/tuple/README_template.md @@ -19,10 +19,21 @@ # Apache DataSketches Tuple Sketches for Google BigQuery -Tuple sketches extend the functionality of Theta sketches by -allowing you to associate a summary value with each item in the set. This -enables calculations like the sum, minimum, or maximum of values associated with -the distinct items. +Tuple sketches extend the functionality of Theta sketches by adding a Summary object associated +with each distinct key retained by the sketch. When the identifier of an input pair (identifier, value) matches a unique +key of the sketch, the associated Summary of that key can be modified based on user-defined policy. +The set of all Summary values collected by the sketch represents a uniform random sample over the unique identifiers +subset of all identifiers. This enables the use of common statistical computations of the Summary values, which can be extrapolated to the entire +set of unique identifiers. + +The underlying C++ library supports Summary objects of any type (including complex types) and arbitrary policies +of updating Summaries during the sketch building process, and combining these Summaries during union and intersection set operations. + +The current set of functions for BigQuery implements Summary objects as INT64 (unsigned in C++) with SUM, MIN, MAX, ONE (constant 1) policies (modes). +This enables calculations like the sum, average, minimum, or maximum of the Summary values associated with the distinct keys. + +This implementation can serve as an example of how to implement Tuple sketch with a Summary type and policy of your choice. +We are open to suggestions on what Summary types and policies to consider for inclusion here. Please visit [Tuple Sketches](https://datasketches.apache.org/docs/Tuple/TupleSketches.html) --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
