Using Quantile sketches for additive metrics

2022-04-30 Thread leerho
Hi Vijay, Please ignore parts of my previous email. The solution is a bit more complicated. Of the three metrics only the Adspend is truly additive. Summing the category fields makes no sense. This means you have to design the implementation of the SummarySetOperations class so that it makes in

Re: Using Quantile sketches for additive metrics

2022-04-30 Thread leerho
Vijay, Sorry about the delay in getting back to you. There is some critical information missing from your description and that is the domain of what you are sketching. I presume that it is User-IDs, otherwise it doesn't make sense. If this is the case I think the solution can be achieved in a coup

Re: Frequent Distinct Tuples Sketch

2022-01-12 Thread leerho
I'd have to think about it more. But the FDT sketch was put in the library as an example. With tuple sketches you would have to write the code that encapsulates the tuple summary cells to do what you want and then extend the summary aggregator to do the proper merge operations. So in a sense th

Re: Frequent Distinct Tuples Sketch

2022-01-12 Thread leerho
Not directly. But the FDT sketch is really pretty simple to code yourself, and is in the library as primarily an example. Nonetheless, one of the reasons that only a few of our sketches have been adapted for Druid is that Druid requires that all sketches be capable of operating off-heap. Which is

Re: Ad impression counting and unique users counting using data sketches

2021-09-16 Thread leerho
fficient, use Roaring Bitmaps(http://roaringbitmap.org/) >> >> 3) Keep only 5% sample of the raw records(random uniform sampling) and >> then extrapolate the query results on the sample but multiplying it with 20 >> >> I would like to note that the above 2 queries are only

Re: Ad impression counting and unique users counting using data sketches

2021-09-15 Thread leerho
Hi Karik, The problem you describe is typical for on-line advertising and similar to ones we have worked on before. Solving this problem with sketches will provide approximate results in near-real time. However, doing so even with sketches may require considerably more resources than you may be p

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-26 Thread leerho
Hi Karl, I just want to explain the reasons you cannot create an UpdateSketch directly from a CompactSketch: The CompactSketch is by definition immutable and has the smallest footprint and simplest structure. It is produced as the result of all of the set operations because the set operations e

[NOTICE] URL's to our Repositories will be changing

2020-12-18 Thread leerho
Folks, Now that we have been approved for graduation by the ASF Board, the URLs to some of our assets will be changing as we transition to a Top-Level Project (TLP). For example: - GitHub Repositories, for example: https://github.com/apache/incubator-datasketches-java will become https:/

Re: [E] Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread leerho
y/verizon-media/> > <http://www.instagram.com/verizonmedia> > > > > On Thu, Nov 19, 2020 at 9:57 AM leerho wrote: > >> Hi Justin, the site you referenced returns an error 500 (internal server >> error). It might be down, or out-of-service. You might also c

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread leerho
e KLL quantiles algorithm that's >>> implemented in the library is implicitly performing a type of downsampling >>> internally and then summarizing the sample (this is a little bit of a >>> simplification). >>> >>> Something similar is true for frequen

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-18 Thread leerho
Sorry, if you presample your data all bets are off in terms of accuracy. On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro wrote: > Hi, I am new to DataSketches. > > I know Datasketches provides an *approximate* calculation of statistics > with *mathematically proven error bounds*. > > My question

Re: [E] Re: HLL Union and lgK config

2020-09-15 Thread leerho
ngs >> in the Kafka Streams app to char[] will be a good first step. >> >> I'll give that a try and report back. >> >> Thanks everyone for your help in finding the source of this! >> >> Kind regards, >> Marko >> >> On Fri, 14 Aug 2020

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
I have placed a [DISCUSS] thread on our d...@datasketches.apache.org list if you wish to suggest some ideas! :) On Fri, Aug 14, 2020 at 4:06 PM leerho wrote: > The other option would be to deprecate the Hive SketchState update(...) > method and create a "newUpdate(...) method that

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
The other option would be to deprecate the Hive SketchState update(...) method and create a "newUpdate(...) method that has strings encode with UTF-8. And also document the reason why. Any other ideas? On Fri, Aug 14, 2020 at 4:03 PM leerho wrote: > Yep! It turns out that there is

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
gs > in the Kafka Streams app to char[] will be a good first step. > > I'll give that a try and report back. > > Thanks everyone for your help in finding the source of this! > > Kind regards, > Marko > > On Fri, 14 Aug 2020 at 20:58, leerho wrote: > >> Hi Marko

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
counts, to take care of local times, etc..., these should be the correct >> values with excluded days: >> Without first day: 24890 >> Without first and second day: 22989 >> >> Thanks, >> Marko >> >> >> On Fri, 14 Aug 2020 at 17:08, leerho wrote: >

Re: HLL Union and lgK config

2020-08-14 Thread leerho
Hi Marko, I notice that the first two sketches are the result of union operations, while the remaining sketches are pure streaming sketches. Could you perform Jon's request again except excluding the first two sketches? Just to cover the bases, could you explain the types of the data items that ar

Re: HLL Union and lgK config

2020-08-13 Thread leerho
Marko, We are working to understand this problem. Thank you for sending us the actual sketches, That helps us a great deal! Cheers, Lee. On Thu, Aug 13, 2020 at 3:24 PM Jon Malkin wrote: > Hi Marko, > > Could you please let us know two more things: > 1) Which is the one particular sketch th

Re: Support for "advanced" SQL types (in HLL)

2020-07-03 Thread leerho
Csaba, These are some very thoughtful suggestions and I can see that some recommendations in this area would be useful. Our focus in our DataSketches team is really on the sketching algorithms and designing the core sketches to be very high performing, robust, accurate, and easy to integrate (e.g.

Re: Regarding error bounds and confidence of apache KLL implementation

2020-06-22 Thread leerho
aydakov < > sayda...@verizonmedia.com> wrote: > >> Adding the original poster just in case he is not subscribed to the list >> >> On Mon, Jun 22, 2020 at 7:18 PM leerho wrote: >> >>> I see a typo: What I called the Omega relation is actually Omicron (b

Re: Regarding error bounds and confidence of apache KLL implementation

2020-06-22 Thread leerho
read this over at some point and double-check both of > our work :-) > > On Mon, Jun 22, 2020 at 9:14 PM leerho wrote: > >> Hello Gourav, welcome to this forum! >> >> I want to make sure you have access to and have read the code >> documentation for the K

Re: Regarding error bounds and confidence of apache KLL implementation

2020-06-22 Thread leerho
Hello Gourav, welcome to this forum! I want to make sure you have access to and have read the code documentation for the KLL sketch in addition to the papers. Although the code documentation exists for both Java and C++, it is a little easier to access the Javadocs as they are accessible from the

Re: Tuple sketches question

2020-05-20 Thread leerho
Hi David, Thank you for reaching out to us. We are always interested in learning about new users and new uses of the library, especially with Tuple sketches, which we do not hear much feedback about. Let me try to address some of your questions: The Tuple Sketch is an "extension" of the Theta

Re: Public slack invitation

2020-05-20 Thread leerho
There is something wrong with that link. Meanwhile I have added your email & name on your behalf for the #datasketches channel on the-asf.slack.com workspace. Lee. On Wed, May 20, 2020 at 2:50 AM David Cromberge < david.crombe...@permutive.com> wrote: > Hello, > > I would like to join the sla

Re: Apache Impala integration with DataSketches HLL (C++)

2020-04-27 Thread leerho
Hi Gabor, My quick question would be that taking into account that the order of the > items provided to datasketches:hll_sketch is not deterministic is it normal > behaviour that for the same dataset I get a different estimate each time I > run my query? > I'm trying to figure out if this is due t

Re: Why are so many of the classes in org.apache.datasketches.cpc final?

2020-04-25 Thread leerho
amount. > > Ron > > On Apr 24, 2020, at 3:12 PM, leerho wrote: > > Hi Ron, > > Our mission is to develop a robust sketch library *product* that can be > used in production systems in many different environments and be high > performing and binary compatible across langu

Re: Why are so many of the classes in org.apache.datasketches.cpc final?

2020-04-24 Thread leerho
Hi Ron, Our mission is to develop a robust sketch library *product* that can be used in production systems in many different environments and be high performing and binary compatible across languages and systems. - To be able to achieve this mission with our very limited resources, we have

Welcome

2020-02-11 Thread leerho
This is our new location for all types of questions about the Apache DataSketches project. Questions submitted here can be about any of the DataSketches source component repositories: - Core Sketch Implementations: - Java , -