The request for this enhancement was brought to our attention thanks to David Cromberge, who I hope will join us here. You can find the early exploratory discussion on our Slack #datasketches channel.
There are real use cases where users have created a large history of Theta Sketches and want to leverage the ability of Tuple Sketches to associate user-defined attributes to the hash keys as well as the inherent ability of the Theta and Tuple Families to perform set operations. But currently it is not possible to do any mixing or set operations between Theta Sketches and Tuple sketches. *Background* The Tuple Sketch is an "extension" of the Theta Sketch theoretically, but not programmatically. The underlying sketch algorithm of updating new items, merging, and getting estimates and error bounds are identical between the two sketches. Their behaviour is identical with respect to unique counting estimates and set operations. However, they do not share hardly any code. The reason is because the Theta Sketch has been highly optimized for performance, and rewriting it so that the Tuple Sketch could be a programmatic extension, would impact the performance of the Theta Sketch or greatly increase the complexity of the Theta Sketch code, which is very heavily used. It would be nice if we could, but we haven't had the time to figure out a way to do that without extensive rewriting of both sketches. So currently, there is no mixing of Theta Sketch objects with Tuple Sketch objects. And, currently, there is no mechanism for easily converting Theta Sketches to Tuple Sketches or visa versa. *Proposal* Add a new update method to the Tuple Union operator with a signature something like this: public void update(final org.apache.datasketches.theta.Sketch sketchIn, > final S summary) { ... } > The first parameter is the Theta sketch. The Summary, S, can be dynamic or the user can supply the SummaryFactory.newSummary() if the default summary is desired. The SummarySetOperations, which specifies how the merge is to be performed is already specified when the Union is created. Similar new methods would be added to the Intersection operator. However, the AnotB operator doesn't interact with the summaries at all. Since Tuple not Theta -> Tuple, and Theta not Tuple -> Theta. Given these additions, it is not clear that we need a separate way to convert a filled Theta Sketch plus a default Summary into a Tuple Sketch. Derived classes with defined summary types will need to extend these new methods if they wish to leverage the new capabilities. This adds only three methods to the base Tuple classes. However: - We will want to update the classes that extend these base classes as well - The bulk of the work, I think, will be going through the System Adaptors (Hive, Pig, etc.) to see if we want to add this functionality to those systems. - On top of this will be added unit tests, and a few characterization tests. Because of other restructuring of classes within the Tuple packages and subpackages, which affects current APIs, a major version bump to 2.0.0 would be justified. Comments please. Lee. >