Csaba,
These are some very thoughtful suggestions and I can see that some
recommendations in this area would be useful.

Our focus in our DataSketches team is really on the sketching algorithms
and designing the core sketches to be very high performing, robust,
accurate, and easy to integrate (e.g., few, if any, dependencies).

We developed the Hive, Pig, Druid and PostgreSQL adaptors as examples of
how the core sketches can be adapted for use in large systems.  However, we
do not consider ourselves to be experts in these systems and want to
encourage the respective communities to contribute improvements to these
adaptors as needed to improve their usefulness.

The suggestions you make here are important, and fall into the category of
improving the Hive adaptor to make it more useful by making explicit
options as to how various Hive types could be (should be?) converted into
one of the Java types that the underlying sketches accept.  (By the way,
the underlying HLL sketch accepts *int, long, double, String, byte[],
char[], int[], long[]*.)   The more complex Hive types, DATE, TIMESTAMP,
DECIMAL can be easily converted to one of the HLL (or Theta) input types.
The BOOLEAN and SMALLINT have such small entropy I fail to see the
usefulness of devoting a sketch to counting unique booleans (only 2) or
even SMALLINT (only 65K).

Since you are faced with this standardization issue across systems, you are
in a much better position to recommend what the solution should be. Whether
it is just a documentation issue or actually adding methods for these types
is up to you.   We would look forward to a PR from you as to how to
proceed.  If you are unsure, I suggest you post this issue into the Hive,
Impala and DataSketches forums and get some feedback from the respective
communities.

Cheers,

Lee.


On Wed, Jul 1, 2020 at 6:49 AM Csaba Ringhofer <csringho...@cloudera.com>
wrote:

> Hi!
>
> This came up while trying to ensure HLL sketch interoperability between
> Apache Hive and Apache Impala.
>
> Currently in Hive the following types are not supported by ds_hll_sketch():
>   - BOOLEAN
>   - SMALLINT
>   - DECIMAL
>   - TIMESTAMP
>   - DATE
>
> These types vary in complexity and usefulness, e.g. BOOLEAN and SMALLINT
> seem straightforward, while DECIMAL, DATE and TIMESTAMP are often
> represented in several different ways, so choosing which byte sequence to
> hash is not self-evident. It is likely that different projects will do this
> differently, as hashing the native representation is the easiest and
> fastest.
>
> Did these questions already come up in other projects, e.g. how to hash a
> DATE type in a HLL sketch?
>
> If it is a goal to support these in an interoperable way (e.g. a sketch
> created by a Hive can be used for estimation by Impala), then it would be
> useful to come up with some recommendations on how what to hash exactly.
> Some examples to highlight the possible problems:
> DATE:
>  - int32 days since unix epoch (proleptic gregorian)
>  - string in YYYYMMDD format
> TIMESTAMP (nanosecond precision):
>  - int128 nanoseconds since unix epoch (UTC, proleptic gregorian)
>  - string in YYYYMMDD HHmmss.sssssssss format
> DECIMAL(precision, scale):
>  - minimum number of bytes needed to represent range (two's complement)
>  - minimum power of 2 bytes needed to represent range  (two's complement)
>
> Regards,
> Csaba
>
>

Reply via email to