On Mon, 21 Jun 2021 23:50:29 -0400 Ying Zhou <yzhou7...@gmail.com> wrote: > Hi, > > In data people use there are often bounded numbers, mostly integers with > clear and fixed upper and lower bounds but also decimals and floats as well > e.g. test scores, numerous codes in older databases, max temperature of a > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should > include such types in Arrow (and more importantly in Parquet & Avro where > size matters a lot more).
You are expressing two separate concerns here: 1. expressing the semantics (and perhaps enforcing them, e.g. return an error when an addition gives a result out of bounds) 2. improving performance / resource usage I would reject concern #2. In Arrow, we probably don't want to standardize integers with a non-power of two bitwidth. In Parquet, integer compression already takes advantage of actual magnitude (using e.g. DELTA_BINARY_PACKED: https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5). Additional information about the expected magnitude would probably not bring any additional gains. As for concern #1, I have no strong opinion. Perhaps that could be expressed as custom metadata, or perhaps as a dedicated parametric BoundInteger datatype. Regards Antoine.