To add onto Antoine's two points: 1. expressing the semantics (and perhaps enforcing them, e.g. return an > error when an addition gives a result out of bounds)
There was a proposed extension type to capture Range/Interval ( https://issues.apache.org/jira/browse/ARROW-12637). I can imagine having a kernel that takes a scalar range/interval and applies it to integers (and then maybe tracks the metadata). There has also been some discussions on structured metadata for data statistics but no one has put in the effort to formalize a proposal on this. > 2. improving performance / resource usage As Antoine noted, Parquet encoding already deals with this well. For Arrow at some point we might introduce alternative encodings that could save space (https://github.com/apache/arrow/pull/4815 is an old proposal) that could be used to reduce the bit-width requirements. As noted by Antoine, I don't expect Arrow to support non-power of 2 integers though. There has also been some proposals to support lower bit-width Decimal types which could also help for things like temperature. On Tue, Jun 22, 2021 at 7:02 AM Alessandro Molina < alessan...@ursacomputing.com> wrote: > On Tue, Jun 22, 2021 at 12:27 PM Antoine Pitrou <anto...@python.org> > wrote: > > > On Mon, 21 Jun 2021 23:50:29 -0400 > > Ying Zhou <yzhou7...@gmail.com> wrote: > > > Hi, > > > > > > In data people use there are often bounded numbers, mostly integers > with > > clear and fixed upper and lower bounds but also decimals and floats as > well > > e.g. test scores, numerous codes in older databases, max temperature of a > > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should > > include such types in Arrow (and more importantly in Parquet & Avro where > > size matters a lot more). > > > > You are expressing two separate concerns here: > > 1. expressing the semantics (and perhaps enforcing them, e.g. return an > > error when an addition gives a result out of bounds) > > > > I wonder if DictionaryArray could be a foundation for such semantics. It > doesn't seem unreasonable to have a check that prevents you from adding > values that are outside of the values accepted by the dictionary. Seems > reasonable to implement most things like test scores, temperatures etc... > Probably unreasonable for things with a bigger domain of valid values like > coordinates and floats in general. >