Re: [DISCUSS] Enhancements to Tuple Sketch.

David Cromberge Thu, 28 May 2020 07:46:24 -0700

As a follow-up to the proposal below, it may not be necessary to provide a 
combiner and default set of values together, since there may be some redundancy 
here.
It’s probably better for me to re-iterate the intention - to provide a 
meaningful way to interpret the presence of a key in a theta with some suitable 
default value.


> On 28 May 2020, at 15:39, David Cromberge <david.crombe...@permutive.com> 
> wrote:
> 
> Lee, the array of bytes implementation is a better approach than adding 
> off-heap implementations for the integer, double and strings Tuple sketches.
> 
> I originally overlooked the ArrayOfDoubles sketch for the purposes of 
> tracking engagement, since it implies that many values could be associated 
> with a hashed key, which doesn’t quite fit the use case.  
> 
> Having said that, I have now looked through the implementation and have 
> switched to the array of doubles sketch instead - after all, you pointed out 
> that it should suffice. 
> 
> I have run some initial benchmarks on sizing, and in compacted form I did not 
> get a reduction in the size of a compacted sketch generated from a real data 
> set, despite the benefits of using primitive doubles.  I realise that this is 
> dependent on the test case and tuning / configuration parameters, so we could 
> add a TODO item to add a characterisation test for this, if there is not one 
> already.
> 
> We don’t seem to have discussed how the Theta sketches may be included for 
> intersections and unions regarding an AOD sketch.   For unions, the behaviour 
> is delegated to a merge operation on the sketch, which ultimately adds values 
> together for the same key.  Concerning intersection, a combiner 
> implementation is used to determine how values should be combined.  It is 
> noteworthy that in the druid extension, the values are summed together, with 
> a comment noting that this may not apply to all circumstances.
> 
> I would propose a similar mechanism for both union and intersection on the 
> other sketches, where a default array of values can be provided for a tuple 
> sketch:
> 
>       public void update(final org.apache.datasketches.theta.Sketch sketchIn, 
> double[] defaultValues, ArrayOfDoublesCombiner c) {...}
> 
> Of course, a factory could be provided that creates a combiner according to a 
> summary mode.  The use case for this suggestion is to have context specific 
> behaviour with regard to merging / combining values in the case of unions and 
> intersections, which could be sourced from the user or user query.
> 
> I would be interested to hear your thoughts on the 
> ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches,
> David
> 
> 
>> On 28 May 2020, at 04:32, leerho <lee...@gmail.com 
>> <mailto:lee...@gmail.com>> wrote:
>> 
>> David,  In fact, double values do just fine with integer data.  They are 
>> twice as big at the primitive level, however, the dedicated ArrayOfDoubles 
>> implementation in total might actually be smaller, since it is not carrying 
>> all the object overhead that are required to do generics. Plus, it will be a 
>> whole lot faster! It is already fully implemented with all the set 
>> operations, off-heap memory, serialization /deserialization and a full test 
>> suite.  
>> 
>> Designing a similar dedicated ArrayOfIntegers would be a lot of work and 
>> wouldn't be my top priority for the next dedicated Tuple sketch to build.  
>> What would be more flexible would actually be a dedicated ArrayOfBytes 
>> implementation, because bytes are the foundation from which we can derive 
>> almost any summary we want.  
>> 
>> Think about it.
>> 
>> Lee.
>> 
>> 
>> 
>> On Wed, May 27, 2020 at 5:54 PM leerho <lee...@gmail.com 
>> <mailto:lee...@gmail.com>> wrote:
>> David,  This is good feedback.  I didn't realize you wanted off-heap 
>> operation.  That changes a lot of things.  I have to go right now, but let 
>> me think about this.  Attempting to leverage generics off-heap is messy.  
>> For your case it may be better to either leverage the ArrayOfDoubles 
>> implementation, which already operates off-heap, or think about emulating 
>> the AOD and creating a dedicated AOIntegers.  
>> 
>> Lee.
>> 
>> On Wed, May 27, 2020 at 3:04 PM David Cromberge 
>> <david.crombe...@permutive.com <mailto:david.crombe...@permutive.com>> wrote:
>> Hi Lee,
>> 
>> Thanks for providing such a detailed plan of action for the Tuple sketch 
>> package.
>> The enhancements that you have listed are interesting, and I will certainly 
>> check out your branch to get a clearer understanding of how the library is 
>> evolving.
>> 
>> For what it’s worth, here is a record of my attempt to integrate Theta 
>> sketches into the Tuple sketch set operations:
>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
>>  
>> <https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5>
>> 
>> Although I have a cursory understanding of the library’s internals, I 
>> included the commit above because there were some interesting tradeoffs to 
>> the implementation, and it gave me a better appreciation for the the 
>> internal workings of the existing Tuple sketch as well as some of the finer 
>> points in your improvement work.   To a lesser degree, it also serves to 
>> independently confirm your argument for adding new variants of the update 
>> methods!
>> 
>> During implementation, I was also faced with the decision as to whether to 
>> duplicate the methods or to convert a Theta sketch to a Tuple sketch first 
>> and delegate to the existing methods.  But, as you noted, this requires an 
>> additional iteration through the result set and incurs a performance 
>> penalty.  Therefore, I also duplicated the existing update methods, with 
>> some changes for result extraction.  To ensure correctness, I found it 
>> necessary to duplicate a large portion of the existing test cases as well - 
>> replicating so many of the existing tests was not ideal, but helped verify 
>> that the implementation was correct.
>> It’s also worth mentioning that I had some difficulty implementing the AnotB 
>> functionality, and in fact the results were incorrect when the sketch 
>> crossed into estimation mode (see ignored tests).  I’m pleased to have 
>> attempted the exercise because it will give much better context as I study 
>> your branch further - especially the AnotB refactoring.
>> 
>> There is one addition that I would like to suggest to your list of TODO 
>> items - namely, off-heap implementations.  I am considering using the 
>> integer tuple sketch for our engagement use case and would prefer to prevent 
>> memory pressure by loading many sketches onto the heap.  I have noticed this 
>> come up in the past on #datasketches slack channel in a conversation with 
>> some Druid team members.  It appears that the off-heap implementations were 
>> omitted from the library due to time constraints, and this is area where I 
>> could also potentially provide default implementations for the other tuple 
>> sketches.  I think this is important to consider, because the existing 
>> ArrayOfDoubles implementation uses an abstract class for the parent.  Making 
>> the other Tuple sketches abstract in a similar manner is potentially a 
>> breaking change as well.
>> 
>> I am excited to collaborate on this together on this feature, and I would be 
>> happy to contribute in any possible way and coordinate through the project 
>> TODO page <https://github.com/apache/incubator-datasketches-java/projects/1>!
>> 
>> David
>> 
>> 
>> 
>>> On 27 May 2020, at 20:04, leerho <lee...@gmail.com 
>>> <mailto:lee...@gmail.com>> wrote:
>>> 
>>> David,
>>> 
>>> Thanks.  I have been putting a lot of thought into it as well and decided 
>>> that it was time to make some other long-needed changes in the Tuple Family 
>>> of sketches as well including the package layout, which has been quite 
>>> cumbersome.  I would suggest holding back on you actual implementation work 
>>> until you understand what I have changed so far and then we can strategize 
>>> on how to finish the work.   I have checked in my changes so far into the 
>>> "Tuple_Theta_Extension" branch, which you can check out to see what I have 
>>> been up to :)
>>> 
>>> The family of tuple sketches have evolved over time and somewhat 
>>> haphazardly.  So the first think I decided to do was to do some rearranging 
>>> of the package structure to make future downstream improvements and 
>>> extensions easier.  
>>> 
>>> 1.  The first problem is that the root tuple directory was cluttered with 
>>> two different groups of classes that made it difficult for anyone to figure 
>>> out what is going on.  One group of classes form the base generic classes 
>>> of the tuple sketch on which the concrete extensions "adouble" (a single 
>>> double), "aninteger" (a single integer), and "strings" (array of strings) 
>>> depend.  These three concrete extensions are already in their own sub 
>>> directories.  
>>> 
>>> The second, largest group of classes were a dedicated non-generic 
>>> implementation of the tuple sketch, which implemented an array of doubles.  
>>> All of these classes had "ArrayOfDoubles" in their name.  These classes 
>>> shared no code with the root generic tuple classes except for a few methods 
>>> in the SerializerDeserializer and the Util classes.  By making a few 
>>> methods public, I was able to move all of the "ArrayOfDoubles" classes into 
>>> their own subdirectory.  This creates an incompatible API break, which will 
>>> force us to move to a 2.0.0 for the next version.   Now the tuple root 
>>> directory is much cleaner and easier to navigate and understand.  There are 
>>> several reasons for this separate dedicated implementation. First, we felt 
>>> that a configurable array of doubles would be a relatively common use case. 
>>>  Second, we wanted a full concrete example of the tuple sketch as an 
>>> example of what it would look like including both on-heap and off-heap 
>>> variants.   It is this ArrayOfDoubles implementation that has been 
>>> integrated into Druid, for example. 
>>> 
>>> 2. Now that the package directories are cleaned up I was able to focus on 
>>> what it would mean to allow Tuple sketches to perform set operations with 
>>> Theta sketches.  
>>> 
>>> One approach would be to just provide a converter to take in a Theta sketch 
>>> and produce a Tuple sketch with some default or configured summary and 
>>> leave everything else the way it is.  But this is less efficient as it 
>>> requires more object creation and copying than a direct integration would.  
>>> It turns out that modifying the generic Union and Intersection classes only 
>>> required adding one method to each.  I did some minor code cleanup and code 
>>> documentation at the same time.  
>>> 
>>> The AnotB operator is another story.  We have never been really happy with 
>>> how this was implemented the first time.  The current API is clumsy.  So I 
>>> have taken the opportunity to redesign the API for this class.  It still 
>>> has the current API methods but deprecated.  With the new modified class 
>>> the user has several ways of performing AnotB.
>>> 
>>> As stateless operations:
>>> With Tuple: resultSk = aNotB(skTupleA, skTupleB);
>>> With Theta: resultSk = aNotB(skTupleA, skThetaB);
>>> As stateful, sequential operations:
>>> void setA(skTupleA);
>>> void notB(skTupleB);   or   void notB(skThetaB);   //These are 
>>> interchangable.
>>> ...
>>> void notB(skTupleB);   or   void notB(skThetaB);   //These are 
>>> interchangable.
>>> resultSk = getResult(reset = false);  // This allows getting an 
>>> intermediate result
>>> void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
>>> resultSK = getResult(reset = true); //This returns the result and clears 
>>> the internal state to empty.
>>> This I think is pretty slick and flexible.  
>>> 
>>> Work yet to be done on main:
>>> Reexamine the Union and Intersection APIs to add the option of an 
>>> intermediate result.
>>> Update the other concrete extensions to take advantage of the above new 
>>> API: "aninteger", "strings".
>>> Examine the dedicated "ArrayOfDoubes" implementation to see how hard it 
>>> would be to make the same changes as above.  Implement. Test.
>>> Work yet to be done on test:
>>> 
>>> I did major redesign of the testing class for the AnotB generic class using 
>>> the "adouble" concrete extension.  You can see this in 
>>> AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of the 
>>> base AnotB classes via the concrete extension.   
>>> With the deep testing using the "adouble" done, we still need to design new 
>>> tests for the "aninteger" and "strings" extensions.  These can be shallow 
>>> tests.
>>> If we decide to do the same API extensions on the ArrayOfDoubles classes, 
>>> those will need to be tested.
>>> Work to be done on documentation:
>>> The website documentation is still rather thin on the whole Tuple family.  
>>> Having someone that is a real user of these classes contribute to the 
>>> documentation to make it more understandable would be outstanding!
>>> Work to be done on characterization.
>>> The Tuple family has some characterization, but it is sparse and a lot more 
>>> would work here would give users a sense of the performance they could 
>>> expect.  We have also found that characterization is a powerful way to find 
>>> statistical bugs that don't show up in unit tests.   I could guide you 
>>> through how to set up the various "test harnesses", which is really pretty 
>>> simple, but the real thinking goes into the design of the test and 
>>> understanding the output.  This is a great way to really understand how 
>>> these sketches behave and why.  
>>> Work to be done on code reviews:
>>> Having independent set of eyes going over the code would also be a huge 
>>> contribution.  
>>> Once you have had a chance to study this we should talk about how you want 
>>> to contribute.  Clearly a lot of what I have done so far required deep 
>>> understanding of the Tuple and Theta classes and was was much more 
>>> efficient for me to do.  It would have been a hard slog for anyone new to 
>>> the library to undertake.
>>> 
>>> Once we decide on a strategy, we should put kanban cards in the project 
>>> TODO page 
>>> <https://github.com/apache/incubator-datasketches-java/projects/1>.
>>> 
>>> Please let me know what you think!
>>> 
>>> Lee.
>>> 
>>>    
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, May 27, 2020 at 7:53 AM David Cromberge 
>>> <david.crombe...@permutive.com <mailto:david.crombe...@permutive.com>> 
>>> wrote:
>>> Thank you Lee for your proposal regarding my use case and Tuple sketches.
>>> 
>>> I have spent some time considering the proposal, and I have started 
>>> implementing a potential solution.
>>> 
>>> At what stage of the pipeline should characterisation tests be proposed, 
>>> since they would obviously depend on a new SNAPSHOT version of the core 
>>> library being available? 
>>> 
>>> I would be grateful for any input about the characterisation workflow.
>>> 
>>> Thank you,
>>> David
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org 
>>> <mailto:dev-unsubscr...@datasketches.apache.org>
>>> For additional commands, e-mail: dev-h...@datasketches.apache.org 
>>> <mailto:dev-h...@datasketches.apache.org>
>>> 
>> 
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to