Re: [DISCUSS] Enhancements to Tuple Sketch.

David Cromberge Wed, 27 May 2020 15:05:09 -0700

Hi Lee,

Thanks for providing such a detailed plan of action for the Tuple sketch 
package.
The enhancements that you have listed are interesting, and I will certainly 
check out your branch to get a clearer understanding of how the library is 
evolving.


For what it’s worth, here is a record of my attempt to integrate Theta sketches 
into the Tuple sketch set operations:
https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
 
<https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5>

Although I have a cursory understanding of the library’s internals, I included 
the commit above because there were some interesting tradeoffs to the 
implementation, and it gave me a better appreciation for the the internal 
workings of the existing Tuple sketch as well as some of the finer points in 
your improvement work.   To a lesser degree, it also serves to independently 
confirm your argument for adding new variants of the update methods!

During implementation, I was also faced with the decision as to whether to 
duplicate the methods or to convert a Theta sketch to a Tuple sketch first and 
delegate to the existing methods.  But, as you noted, this requires an 
additional iteration through the result set and incurs a performance penalty.  
Therefore, I also duplicated the existing update methods, with some changes for 
result extraction.  To ensure correctness, I found it necessary to duplicate a 
large portion of the existing test cases as well - replicating so many of the 
existing tests was not ideal, but helped verify that the implementation was 
correct.
It’s also worth mentioning that I had some difficulty implementing the AnotB 
functionality, and in fact the results were incorrect when the sketch crossed 
into estimation mode (see ignored tests).  I’m pleased to have attempted the 
exercise because it will give much better context as I study your branch 
further - especially the AnotB refactoring.

There is one addition that I would like to suggest to your list of TODO items - 
namely, off-heap implementations.  I am considering using the integer tuple 
sketch for our engagement use case and would prefer to prevent memory pressure 
by loading many sketches onto the heap.  I have noticed this come up in the 
past on #datasketches slack channel in a conversation with some Druid team 
members.  It appears that the off-heap implementations were omitted from the 
library due to time constraints, and this is area where I could also 
potentially provide default implementations for the other tuple sketches.  I 
think this is important to consider, because the existing ArrayOfDoubles 
implementation uses an abstract class for the parent.  Making the other Tuple 
sketches abstract in a similar manner is potentially a breaking change as well.

I am excited to collaborate on this together on this feature, and I would be 
happy to contribute in any possible way and coordinate through the project TODO 
page <https://github.com/apache/incubator-datasketches-java/projects/1>!

David



> On 27 May 2020, at 20:04, leerho <[email protected]> wrote:
> 
> David,
> 
> Thanks.  I have been putting a lot of thought into it as well and decided 
> that it was time to make some other long-needed changes in the Tuple Family 
> of sketches as well including the package layout, which has been quite 
> cumbersome.  I would suggest holding back on you actual implementation work 
> until you understand what I have changed so far and then we can strategize on 
> how to finish the work.   I have checked in my changes so far into the 
> "Tuple_Theta_Extension" branch, which you can check out to see what I have 
> been up to :)
> 
> The family of tuple sketches have evolved over time and somewhat haphazardly. 
>  So the first think I decided to do was to do some rearranging of the package 
> structure to make future downstream improvements and extensions easier.  
> 
> 1.  The first problem is that the root tuple directory was cluttered with two 
> different groups of classes that made it difficult for anyone to figure out 
> what is going on.  One group of classes form the base generic classes of the 
> tuple sketch on which the concrete extensions "adouble" (a single double), 
> "aninteger" (a single integer), and "strings" (array of strings) depend.  
> These three concrete extensions are already in their own sub directories.  
> 
> The second, largest group of classes were a dedicated non-generic 
> implementation of the tuple sketch, which implemented an array of doubles.  
> All of these classes had "ArrayOfDoubles" in their name.  These classes 
> shared no code with the root generic tuple classes except for a few methods 
> in the SerializerDeserializer and the Util classes.  By making a few methods 
> public, I was able to move all of the "ArrayOfDoubles" classes into their own 
> subdirectory.  This creates an incompatible API break, which will force us to 
> move to a 2.0.0 for the next version.   Now the tuple root directory is much 
> cleaner and easier to navigate and understand.  There are several reasons for 
> this separate dedicated implementation. First, we felt that a configurable 
> array of doubles would be a relatively common use case.  Second, we wanted a 
> full concrete example of the tuple sketch as an example of what it would look 
> like including both on-heap and off-heap variants.   It is this 
> ArrayOfDoubles implementation that has been integrated into Druid, for 
> example. 
> 
> 2. Now that the package directories are cleaned up I was able to focus on 
> what it would mean to allow Tuple sketches to perform set operations with 
> Theta sketches.  
> 
> One approach would be to just provide a converter to take in a Theta sketch 
> and produce a Tuple sketch with some default or configured summary and leave 
> everything else the way it is.  But this is less efficient as it requires 
> more object creation and copying than a direct integration would.  It turns 
> out that modifying the generic Union and Intersection classes only required 
> adding one method to each.  I did some minor code cleanup and code 
> documentation at the same time.  
> 
> The AnotB operator is another story.  We have never been really happy with 
> how this was implemented the first time.  The current API is clumsy.  So I 
> have taken the opportunity to redesign the API for this class.  It still has 
> the current API methods but deprecated.  With the new modified class the user 
> has several ways of performing AnotB.
> 
> As stateless operations:
> With Tuple: resultSk = aNotB(skTupleA, skTupleB);
> With Theta: resultSk = aNotB(skTupleA, skThetaB);
> As stateful, sequential operations:
> void setA(skTupleA);
> void notB(skTupleB);   or   void notB(skThetaB);   //These are interchangable.
> ...
> void notB(skTupleB);   or   void notB(skThetaB);   //These are interchangable.
> resultSk = getResult(reset = false);  // This allows getting an intermediate 
> result
> void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
> resultSK = getResult(reset = true); //This returns the result and clears the 
> internal state to empty.
> This I think is pretty slick and flexible.  
> 
> Work yet to be done on main:
> Reexamine the Union and Intersection APIs to add the option of an 
> intermediate result.
> Update the other concrete extensions to take advantage of the above new API: 
> "aninteger", "strings".
> Examine the dedicated "ArrayOfDoubes" implementation to see how hard it would 
> be to make the same changes as above.  Implement. Test.
> Work yet to be done on test:
> 
> I did major redesign of the testing class for the AnotB generic class using 
> the "adouble" concrete extension.  You can see this in AdoubleAnotBTest.java. 
>  This is essentially a deep exhaustive test of the base AnotB classes via the 
> concrete extension.   
> With the deep testing using the "adouble" done, we still need to design new 
> tests for the "aninteger" and "strings" extensions.  These can be shallow 
> tests.
> If we decide to do the same API extensions on the ArrayOfDoubles classes, 
> those will need to be tested.
> Work to be done on documentation:
> The website documentation is still rather thin on the whole Tuple family.  
> Having someone that is a real user of these classes contribute to the 
> documentation to make it more understandable would be outstanding!
> Work to be done on characterization.
> The Tuple family has some characterization, but it is sparse and a lot more 
> would work here would give users a sense of the performance they could 
> expect.  We have also found that characterization is a powerful way to find 
> statistical bugs that don't show up in unit tests.   I could guide you 
> through how to set up the various "test harnesses", which is really pretty 
> simple, but the real thinking goes into the design of the test and 
> understanding the output.  This is a great way to really understand how these 
> sketches behave and why.  
> Work to be done on code reviews:
> Having independent set of eyes going over the code would also be a huge 
> contribution.  
> Once you have had a chance to study this we should talk about how you want to 
> contribute.  Clearly a lot of what I have done so far required deep 
> understanding of the Tuple and Theta classes and was was much more efficient 
> for me to do.  It would have been a hard slog for anyone new to the library 
> to undertake.
> 
> Once we decide on a strategy, we should put kanban cards in the project TODO 
> page <https://github.com/apache/incubator-datasketches-java/projects/1>.
> 
> Please let me know what you think!
> 
> Lee.
> 
>    
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Wed, May 27, 2020 at 7:53 AM David Cromberge 
> <[email protected] <mailto:[email protected]>> wrote:
> Thank you Lee for your proposal regarding my use case and Tuple sketches.
> 
> I have spent some time considering the proposal, and I have started 
> implementing a potential solution.
> 
> At what stage of the pipeline should characterisation tests be proposed, 
> since they would obviously depend on a new SNAPSHOT version of the core 
> library being available? 
> 
> I would be grateful for any input about the characterisation workflow.
> 
> Thank you,
> David
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] 
> <mailto:[email protected]>
> For additional commands, e-mail: [email protected] 
> <mailto:[email protected]>
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to