As a follow-up to the proposal below, it may not be necessary to provide a combiner and default set of values together, since there may be some redundancy here. It’s probably better for me to re-iterate the intention - to provide a meaningful way to interpret the presence of a key in a theta with some suitable default value.
> On 28 May 2020, at 15:39, David Cromberge <david.crombe...@permutive.com> > wrote: > > Lee, the array of bytes implementation is a better approach than adding > off-heap implementations for the integer, double and strings Tuple sketches. > > I originally overlooked the ArrayOfDoubles sketch for the purposes of > tracking engagement, since it implies that many values could be associated > with a hashed key, which doesn’t quite fit the use case. > > Having said that, I have now looked through the implementation and have > switched to the array of doubles sketch instead - after all, you pointed out > that it should suffice. > > I have run some initial benchmarks on sizing, and in compacted form I did not > get a reduction in the size of a compacted sketch generated from a real data > set, despite the benefits of using primitive doubles. I realise that this is > dependent on the test case and tuning / configuration parameters, so we could > add a TODO item to add a characterisation test for this, if there is not one > already. > > We don’t seem to have discussed how the Theta sketches may be included for > intersections and unions regarding an AOD sketch. For unions, the behaviour > is delegated to a merge operation on the sketch, which ultimately adds values > together for the same key. Concerning intersection, a combiner > implementation is used to determine how values should be combined. It is > noteworthy that in the druid extension, the values are summed together, with > a comment noting that this may not apply to all circumstances. > > I would propose a similar mechanism for both union and intersection on the > other sketches, where a default array of values can be provided for a tuple > sketch: > > public void update(final org.apache.datasketches.theta.Sketch sketchIn, > double[] defaultValues, ArrayOfDoublesCombiner c) {...} > > Of course, a factory could be provided that creates a combiner according to a > summary mode. The use case for this suggestion is to have context specific > behaviour with regard to merging / combining values in the case of unions and > intersections, which could be sourced from the user or user query. > > I would be interested to hear your thoughts on the > ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches, > David > > >> On 28 May 2020, at 04:32, leerho <lee...@gmail.com >> <mailto:lee...@gmail.com>> wrote: >> >> David, In fact, double values do just fine with integer data. They are >> twice as big at the primitive level, however, the dedicated ArrayOfDoubles >> implementation in total might actually be smaller, since it is not carrying >> all the object overhead that are required to do generics. Plus, it will be a >> whole lot faster! It is already fully implemented with all the set >> operations, off-heap memory, serialization /deserialization and a full test >> suite. >> >> Designing a similar dedicated ArrayOfIntegers would be a lot of work and >> wouldn't be my top priority for the next dedicated Tuple sketch to build. >> What would be more flexible would actually be a dedicated ArrayOfBytes >> implementation, because bytes are the foundation from which we can derive >> almost any summary we want. >> >> Think about it. >> >> Lee. >> >> >> >> On Wed, May 27, 2020 at 5:54 PM leerho <lee...@gmail.com >> <mailto:lee...@gmail.com>> wrote: >> David, This is good feedback. I didn't realize you wanted off-heap >> operation. That changes a lot of things. I have to go right now, but let >> me think about this. Attempting to leverage generics off-heap is messy. >> For your case it may be better to either leverage the ArrayOfDoubles >> implementation, which already operates off-heap, or think about emulating >> the AOD and creating a dedicated AOIntegers. >> >> Lee. >> >> On Wed, May 27, 2020 at 3:04 PM David Cromberge >> <david.crombe...@permutive.com <mailto:david.crombe...@permutive.com>> wrote: >> Hi Lee, >> >> Thanks for providing such a detailed plan of action for the Tuple sketch >> package. >> The enhancements that you have listed are interesting, and I will certainly >> check out your branch to get a clearer understanding of how the library is >> evolving. >> >> For what it’s worth, here is a record of my attempt to integrate Theta >> sketches into the Tuple sketch set operations: >> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5 >> >> <https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5> >> >> Although I have a cursory understanding of the library’s internals, I >> included the commit above because there were some interesting tradeoffs to >> the implementation, and it gave me a better appreciation for the the >> internal workings of the existing Tuple sketch as well as some of the finer >> points in your improvement work. To a lesser degree, it also serves to >> independently confirm your argument for adding new variants of the update >> methods! >> >> During implementation, I was also faced with the decision as to whether to >> duplicate the methods or to convert a Theta sketch to a Tuple sketch first >> and delegate to the existing methods. But, as you noted, this requires an >> additional iteration through the result set and incurs a performance >> penalty. Therefore, I also duplicated the existing update methods, with >> some changes for result extraction. To ensure correctness, I found it >> necessary to duplicate a large portion of the existing test cases as well - >> replicating so many of the existing tests was not ideal, but helped verify >> that the implementation was correct. >> It’s also worth mentioning that I had some difficulty implementing the AnotB >> functionality, and in fact the results were incorrect when the sketch >> crossed into estimation mode (see ignored tests). I’m pleased to have >> attempted the exercise because it will give much better context as I study >> your branch further - especially the AnotB refactoring. >> >> There is one addition that I would like to suggest to your list of TODO >> items - namely, off-heap implementations. I am considering using the >> integer tuple sketch for our engagement use case and would prefer to prevent >> memory pressure by loading many sketches onto the heap. I have noticed this >> come up in the past on #datasketches slack channel in a conversation with >> some Druid team members. It appears that the off-heap implementations were >> omitted from the library due to time constraints, and this is area where I >> could also potentially provide default implementations for the other tuple >> sketches. I think this is important to consider, because the existing >> ArrayOfDoubles implementation uses an abstract class for the parent. Making >> the other Tuple sketches abstract in a similar manner is potentially a >> breaking change as well. >> >> I am excited to collaborate on this together on this feature, and I would be >> happy to contribute in any possible way and coordinate through the project >> TODO page <https://github.com/apache/incubator-datasketches-java/projects/1>! >> >> David >> >> >> >>> On 27 May 2020, at 20:04, leerho <lee...@gmail.com >>> <mailto:lee...@gmail.com>> wrote: >>> >>> David, >>> >>> Thanks. I have been putting a lot of thought into it as well and decided >>> that it was time to make some other long-needed changes in the Tuple Family >>> of sketches as well including the package layout, which has been quite >>> cumbersome. I would suggest holding back on you actual implementation work >>> until you understand what I have changed so far and then we can strategize >>> on how to finish the work. I have checked in my changes so far into the >>> "Tuple_Theta_Extension" branch, which you can check out to see what I have >>> been up to :) >>> >>> The family of tuple sketches have evolved over time and somewhat >>> haphazardly. So the first think I decided to do was to do some rearranging >>> of the package structure to make future downstream improvements and >>> extensions easier. >>> >>> 1. The first problem is that the root tuple directory was cluttered with >>> two different groups of classes that made it difficult for anyone to figure >>> out what is going on. One group of classes form the base generic classes >>> of the tuple sketch on which the concrete extensions "adouble" (a single >>> double), "aninteger" (a single integer), and "strings" (array of strings) >>> depend. These three concrete extensions are already in their own sub >>> directories. >>> >>> The second, largest group of classes were a dedicated non-generic >>> implementation of the tuple sketch, which implemented an array of doubles. >>> All of these classes had "ArrayOfDoubles" in their name. These classes >>> shared no code with the root generic tuple classes except for a few methods >>> in the SerializerDeserializer and the Util classes. By making a few >>> methods public, I was able to move all of the "ArrayOfDoubles" classes into >>> their own subdirectory. This creates an incompatible API break, which will >>> force us to move to a 2.0.0 for the next version. Now the tuple root >>> directory is much cleaner and easier to navigate and understand. There are >>> several reasons for this separate dedicated implementation. First, we felt >>> that a configurable array of doubles would be a relatively common use case. >>> Second, we wanted a full concrete example of the tuple sketch as an >>> example of what it would look like including both on-heap and off-heap >>> variants. It is this ArrayOfDoubles implementation that has been >>> integrated into Druid, for example. >>> >>> 2. Now that the package directories are cleaned up I was able to focus on >>> what it would mean to allow Tuple sketches to perform set operations with >>> Theta sketches. >>> >>> One approach would be to just provide a converter to take in a Theta sketch >>> and produce a Tuple sketch with some default or configured summary and >>> leave everything else the way it is. But this is less efficient as it >>> requires more object creation and copying than a direct integration would. >>> It turns out that modifying the generic Union and Intersection classes only >>> required adding one method to each. I did some minor code cleanup and code >>> documentation at the same time. >>> >>> The AnotB operator is another story. We have never been really happy with >>> how this was implemented the first time. The current API is clumsy. So I >>> have taken the opportunity to redesign the API for this class. It still >>> has the current API methods but deprecated. With the new modified class >>> the user has several ways of performing AnotB. >>> >>> As stateless operations: >>> With Tuple: resultSk = aNotB(skTupleA, skTupleB); >>> With Theta: resultSk = aNotB(skTupleA, skThetaB); >>> As stateful, sequential operations: >>> void setA(skTupleA); >>> void notB(skTupleB); or void notB(skThetaB); //These are >>> interchangable. >>> ... >>> void notB(skTupleB); or void notB(skThetaB); //These are >>> interchangable. >>> resultSk = getResult(reset = false); // This allows getting an >>> intermediate result >>> void notB(skTupleB); or void notB(skThetaB); //Continue... >>> resultSK = getResult(reset = true); //This returns the result and clears >>> the internal state to empty. >>> This I think is pretty slick and flexible. >>> >>> Work yet to be done on main: >>> Reexamine the Union and Intersection APIs to add the option of an >>> intermediate result. >>> Update the other concrete extensions to take advantage of the above new >>> API: "aninteger", "strings". >>> Examine the dedicated "ArrayOfDoubes" implementation to see how hard it >>> would be to make the same changes as above. Implement. Test. >>> Work yet to be done on test: >>> >>> I did major redesign of the testing class for the AnotB generic class using >>> the "adouble" concrete extension. You can see this in >>> AdoubleAnotBTest.java. This is essentially a deep exhaustive test of the >>> base AnotB classes via the concrete extension. >>> With the deep testing using the "adouble" done, we still need to design new >>> tests for the "aninteger" and "strings" extensions. These can be shallow >>> tests. >>> If we decide to do the same API extensions on the ArrayOfDoubles classes, >>> those will need to be tested. >>> Work to be done on documentation: >>> The website documentation is still rather thin on the whole Tuple family. >>> Having someone that is a real user of these classes contribute to the >>> documentation to make it more understandable would be outstanding! >>> Work to be done on characterization. >>> The Tuple family has some characterization, but it is sparse and a lot more >>> would work here would give users a sense of the performance they could >>> expect. We have also found that characterization is a powerful way to find >>> statistical bugs that don't show up in unit tests. I could guide you >>> through how to set up the various "test harnesses", which is really pretty >>> simple, but the real thinking goes into the design of the test and >>> understanding the output. This is a great way to really understand how >>> these sketches behave and why. >>> Work to be done on code reviews: >>> Having independent set of eyes going over the code would also be a huge >>> contribution. >>> Once you have had a chance to study this we should talk about how you want >>> to contribute. Clearly a lot of what I have done so far required deep >>> understanding of the Tuple and Theta classes and was was much more >>> efficient for me to do. It would have been a hard slog for anyone new to >>> the library to undertake. >>> >>> Once we decide on a strategy, we should put kanban cards in the project >>> TODO page >>> <https://github.com/apache/incubator-datasketches-java/projects/1>. >>> >>> Please let me know what you think! >>> >>> Lee. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, May 27, 2020 at 7:53 AM David Cromberge >>> <david.crombe...@permutive.com <mailto:david.crombe...@permutive.com>> >>> wrote: >>> Thank you Lee for your proposal regarding my use case and Tuple sketches. >>> >>> I have spent some time considering the proposal, and I have started >>> implementing a potential solution. >>> >>> At what stage of the pipeline should characterisation tests be proposed, >>> since they would obviously depend on a new SNAPSHOT version of the core >>> library being available? >>> >>> I would be grateful for any input about the characterisation workflow. >>> >>> Thank you, >>> David >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org >>> <mailto:dev-unsubscr...@datasketches.apache.org> >>> For additional commands, e-mail: dev-h...@datasketches.apache.org >>> <mailto:dev-h...@datasketches.apache.org> >>> >> >