If you need to use them in an application that is built on Arrow and
Parquet, you can certainly implement an Arrow extension type (on top
of FixedSizeBinary in Arrow, for example).

On Tue, Jun 22, 2021 at 5:27 AM Antoine Pitrou <anto...@python.org> wrote:
>
> On Mon, 21 Jun 2021 23:50:29 -0400
> Ying Zhou <yzhou7...@gmail.com> wrote:
> > Hi,
> >
> > In data people use there are often bounded numbers, mostly integers with 
> > clear and fixed upper and lower bounds but also decimals and floats as well 
> > e.g. test scores, numerous codes in older databases, max temperature of a 
> > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should 
> > include such types in Arrow (and more importantly in Parquet & Avro where 
> > size matters a lot more).
>
> You are expressing two separate concerns here:
> 1. expressing the semantics (and perhaps enforcing them, e.g. return an
>    error when an addition gives a result out of bounds)
> 2. improving performance / resource usage
>
> I would reject concern #2.  In Arrow, we probably don't want to
> standardize integers with a non-power of two bitwidth.  In Parquet,
> integer compression already takes advantage of actual magnitude (using
> e.g. DELTA_BINARY_PACKED:
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5).
> Additional information about the expected magnitude would probably not
> bring any additional gains.
>
> As for concern #1, I have no strong opinion.  Perhaps that could be
> expressed as custom metadata, or perhaps as a dedicated
> parametric BoundInteger datatype.
>
> Regards
>
> Antoine.
>
>

Reply via email to