The request for this enhancement was brought to our attention thanks to
David Cromberge, who I hope will join us here.  You can find the early
exploratory discussion on our Slack #datasketches channel.

There are real use cases where users have created a large history of Theta
Sketches and want to leverage the ability of Tuple Sketches to associate
user-defined attributes to the hash keys as well as the inherent ability of
the Theta and Tuple Families to perform set operations.  But currently it
is not possible to do any mixing or set operations between Theta Sketches
and Tuple sketches.

*Background*
The Tuple Sketch is an "extension" of the Theta Sketch theoretically, but
not programmatically.  The underlying sketch algorithm of updating new
items, merging, and getting estimates and error bounds are identical
between the two sketches.  Their behaviour is identical with respect to
unique counting estimates and set operations.  However, they do not share
hardly any code.  The reason is because the Theta Sketch has been highly
optimized for performance, and rewriting it so that the Tuple Sketch could
be a programmatic extension, would impact the performance of the Theta
Sketch or greatly increase the complexity of the Theta Sketch code, which
is very heavily used.  It would be nice if we could, but we haven't had the
time to figure out a way to do that without extensive rewriting of both
sketches.

So currently, there is no mixing of Theta Sketch objects with Tuple Sketch
objects.  And, currently, there is no mechanism for easily converting Theta
Sketches to Tuple Sketches or visa versa.

*Proposal*

Add a new update method to the Tuple Union operator with a signature
something like this:

public void update(final org.apache.datasketches.theta.Sketch sketchIn,
> final S summary) { ... }
>

The first parameter is the Theta sketch. The Summary, S, can be dynamic or
the user can supply the SummaryFactory.newSummary() if the default summary
is desired. The SummarySetOperations, which specifies how the merge is to
be performed is already specified when the Union is created.

Similar new methods would be added to the Intersection operator.

However, the AnotB operator doesn't interact with the summaries at all.
Since Tuple not Theta -> Tuple, and Theta not Tuple -> Theta.

Given these additions, it is not clear that we need a separate way to
convert a filled Theta Sketch plus a default Summary into a Tuple Sketch.

Derived classes with defined summary types will need to extend these new
methods if they wish to leverage the new capabilities.

This adds only three methods to the base Tuple classes.

However:

   - We will want to update the classes that extend these base classes as
   well
   - The bulk of the work, I think, will be going through the System
   Adaptors (Hive, Pig, etc.) to see if we want to add this functionality to
   those systems.
   - On top of this will be added unit tests, and a few characterization
   tests.

Because of other restructuring of classes within the Tuple packages and
subpackages, which affects current APIs, a major version bump to 2.0.0
would be justified.

Comments please.

Lee.





>

Reply via email to