Hi!

This came up while trying to ensure HLL sketch interoperability between
Apache Hive and Apache Impala.

Currently in Hive the following types are not supported by ds_hll_sketch():
  - BOOLEAN
  - SMALLINT
  - DECIMAL
  - TIMESTAMP
  - DATE

These types vary in complexity and usefulness, e.g. BOOLEAN and SMALLINT
seem straightforward, while DECIMAL, DATE and TIMESTAMP are often
represented in several different ways, so choosing which byte sequence to
hash is not self-evident. It is likely that different projects will do this
differently, as hashing the native representation is the easiest and
fastest.

Did these questions already come up in other projects, e.g. how to hash a
DATE type in a HLL sketch?

If it is a goal to support these in an interoperable way (e.g. a sketch
created by a Hive can be used for estimation by Impala), then it would be
useful to come up with some recommendations on how what to hash exactly.
Some examples to highlight the possible problems:
DATE:
 - int32 days since unix epoch (proleptic gregorian)
 - string in YYYYMMDD format
TIMESTAMP (nanosecond precision):
 - int128 nanoseconds since unix epoch (UTC, proleptic gregorian)
 - string in YYYYMMDD HHmmss.sssssssss format
DECIMAL(precision, scale):
 - minimum number of bytes needed to represent range (two's complement)
 - minimum power of 2 bytes needed to represent range  (two's complement)

Regards,
Csaba

Reply via email to