On Mon, 21 Jun 2021 23:50:29 -0400
Ying Zhou <yzhou7...@gmail.com> wrote:
> Hi,
> 
> In data people use there are often bounded numbers, mostly integers with 
> clear and fixed upper and lower bounds but also decimals and floats as well 
> e.g. test scores, numerous codes in older databases, max temperature of a 
> city, latitudes, longitudes, numerous IDs etc. I wonder whether we should 
> include such types in Arrow (and more importantly in Parquet & Avro where 
> size matters a lot more).

You are expressing two separate concerns here:
1. expressing the semantics (and perhaps enforcing them, e.g. return an
   error when an addition gives a result out of bounds)
2. improving performance / resource usage

I would reject concern #2.  In Arrow, we probably don't want to
standardize integers with a non-power of two bitwidth.  In Parquet,
integer compression already takes advantage of actual magnitude (using
e.g. DELTA_BINARY_PACKED:
https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5).
Additional information about the expected magnitude would probably not
bring any additional gains.

As for concern #1, I have no strong opinion.  Perhaps that could be
expressed as custom metadata, or perhaps as a dedicated
parametric BoundInteger datatype.

Regards

Antoine.


Reply via email to