Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2716447763 oh, you mean the case when union has higher lgk. of course, if the information is lost (sketches are in the estimation mode) then the bounds for the lower lgk

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
nikunjbhartia commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2716443879 I just am not sure how sketches with lower precision ( larger error bounds ), when merged with higher precision - guarantee lower error bounds. The information i

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2716427774 I am not sure I understand your question. The documented bounds are valid for both sketch and union. -- This is an automated message from the Apache Git Serv

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
nikunjbhartia commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2716415942 Would the documented error bounds https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html still be valid in these cases ? -- This is an automated messa

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2716284977 Consider this extreme example: all sketches happened to be in exact mode (did not see enough distinct values to saturate). So the space/accuracy trade-off will

Re: Question about Datasketches HLL as replacement of Clearspring in Apache Cassandra

2025-03-11 Thread Jon Malkin
Just in general, unless explicit care was taken to make things mergeable (bit-level compatibility between hash functions, consistent handling of string inputs, etc) the assumption is that different implementations of the same algorithm will not be mergeable. Within our library we designed things t

Re: Question about Datasketches HLL as replacement of Clearspring in Apache Cassandra

2025-03-11 Thread Lee Rhodes
Hello Štefan, We did a major study and comparison of the DataSketches HLL sketch to the Clearspring implementation of the HLL++ sketch back in 2017 and found that the Clearspring sketch had serious error problems, did not implement th

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2715812324 When we create a fresh union object we need to know lgk. Incoming compact theta sketches don't have any notion of lgk in them. This can be different for other

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
nikunjbhartia commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2715789631 I see what you mean. Just wondering why do we need lg_k while merging in the first place ? Shouldn't we be unconditionally downgrading the precision to the lowe

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2715758325 To clarify my previous comment. We tried partial signatures, but it did not work. Suppose we want to have theta_sketch_agg_union_lgk(sketch, lgk) delegati

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2715735878 Also I think we had some problem with passing a subset of parameters from functions with fewer parameters to full-signature functions. BQ said something like t

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
nikunjbhartia commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2715730680 I agree, having default params and function overloading would have made lives much easier :) Regaring the other comment, As an end user, I wouldn't kno

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
jmalkin commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2715721939 Or BQ can add default parameter values when things are unspecified? :D -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2715715076 We could add a function, but we tried to avoid combinatorial explosion of them. One can pass NULL for the default parameters. So for lgk=14 I would suggest pas

[I] Consider adding theta_sketch_agg_int64_lgk() [datasketches-bigquery]

2025-03-11 Thread via GitHub
nikunjbhartia opened a new issue, #144: URL: https://github.com/apache/datasketches-bigquery/issues/144 Currently there are 2 methods to create theta sketches for int64 : - theta_sketch_agg_int64(value INT64) - theta_sketch_agg_int64_lgk_seed_p(value INT64, params STRUCT NOT AGGREG

Re: [PR] better make targets [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov merged PR #143: URL: https://github.com/apache/datasketches-bigquery/pull/143 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsu

Re: [PR] better make targets [datasketches-bigquery]

2025-03-11 Thread via GitHub
AlexanderSaydakov commented on code in PR #143: URL: https://github.com/apache/datasketches-bigquery/pull/143#discussion_r1989857880 ## README.md: ## @@ -108,21 +108,24 @@ Run the following steps in this repo's root directory to install everything: ```bash gcloud auth appl

Re: [PR] better make targets [datasketches-bigquery]

2025-03-11 Thread via GitHub
jmalkin commented on code in PR #143: URL: https://github.com/apache/datasketches-bigquery/pull/143#discussion_r1989821983 ## README.md: ## @@ -108,21 +108,24 @@ Run the following steps in this repo's root directory to install everything: ```bash gcloud auth application-de

Question about Datasketches HLL as replacement of Clearspring in Apache Cassandra

2025-03-11 Thread Štefan Miklošovič
Hello Datasketches community, I am from Apache Cassandra where we use Clearspring (1) for estimating the cardinalities for rows in Cassandra's SSTables. We serialize the whole HyperLogLog from (1) (more or less) to the disk and then we deserialize it back and we merge all logs together to know the