Csaba, These are some very thoughtful suggestions and I can see that some recommendations in this area would be useful.
Our focus in our DataSketches team is really on the sketching algorithms and designing the core sketches to be very high performing, robust, accurate, and easy to integrate (e.g., few, if any, dependencies). We developed the Hive, Pig, Druid and PostgreSQL adaptors as examples of how the core sketches can be adapted for use in large systems. However, we do not consider ourselves to be experts in these systems and want to encourage the respective communities to contribute improvements to these adaptors as needed to improve their usefulness. The suggestions you make here are important, and fall into the category of improving the Hive adaptor to make it more useful by making explicit options as to how various Hive types could be (should be?) converted into one of the Java types that the underlying sketches accept. (By the way, the underlying HLL sketch accepts *int, long, double, String, byte[], char[], int[], long[]*.) The more complex Hive types, DATE, TIMESTAMP, DECIMAL can be easily converted to one of the HLL (or Theta) input types. The BOOLEAN and SMALLINT have such small entropy I fail to see the usefulness of devoting a sketch to counting unique booleans (only 2) or even SMALLINT (only 65K). Since you are faced with this standardization issue across systems, you are in a much better position to recommend what the solution should be. Whether it is just a documentation issue or actually adding methods for these types is up to you. We would look forward to a PR from you as to how to proceed. If you are unsure, I suggest you post this issue into the Hive, Impala and DataSketches forums and get some feedback from the respective communities. Cheers, Lee. On Wed, Jul 1, 2020 at 6:49 AM Csaba Ringhofer <csringho...@cloudera.com> wrote: > Hi! > > This came up while trying to ensure HLL sketch interoperability between > Apache Hive and Apache Impala. > > Currently in Hive the following types are not supported by ds_hll_sketch(): > - BOOLEAN > - SMALLINT > - DECIMAL > - TIMESTAMP > - DATE > > These types vary in complexity and usefulness, e.g. BOOLEAN and SMALLINT > seem straightforward, while DECIMAL, DATE and TIMESTAMP are often > represented in several different ways, so choosing which byte sequence to > hash is not self-evident. It is likely that different projects will do this > differently, as hashing the native representation is the easiest and > fastest. > > Did these questions already come up in other projects, e.g. how to hash a > DATE type in a HLL sketch? > > If it is a goal to support these in an interoperable way (e.g. a sketch > created by a Hive can be used for estimation by Impala), then it would be > useful to come up with some recommendations on how what to hash exactly. > Some examples to highlight the possible problems: > DATE: > - int32 days since unix epoch (proleptic gregorian) > - string in YYYYMMDD format > TIMESTAMP (nanosecond precision): > - int128 nanoseconds since unix epoch (UTC, proleptic gregorian) > - string in YYYYMMDD HHmmss.sssssssss format > DECIMAL(precision, scale): > - minimum number of bytes needed to represent range (two's complement) > - minimum power of 2 bytes needed to represent range (two's complement) > > Regards, > Csaba > >