Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Chao Sun Thu, 21 Apr 2022 15:29:07 -0700

Any update on this proposal? I think this will be a useful addition
too. I can potentially help on the Rust side implementation.


Chao

On Tue, Mar 8, 2022 at 1:00 PM Jorge Cardoso Leitão
<jorgecarlei...@gmail.com> wrote:
>
> Agreed.
>
> Also, I would like to revise my previous comment about the small risk.
> While prototyping this I did hit some bumps. They primary came from two
> reasons:
>
> * I was unable to find arrow/json files in the arrow-testing generated
> files with a non-default decimal bitwidth (I think we only have the
> on-the-fly generated file in archery)
> * the FFI interface has a default decimal of 128 (`d:{precision}:{scale}`)
> and implementations may not support the 256 case (e.g. Rust has no native
> i256). For these cases, this could be the first non-default decimal
> implementation.
>
> So, maybe we follow the standard procedure?
>
> Best,
> Jorge
>
>
>
> On Tue, Mar 8, 2022 at 9:22 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > >
> > > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > > it’ll help achieve better performance on TPC-H (and maybe other
> > > benchmarks). The decimal columns need only 12 digits of precision, for
> > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > > decimal to be faster.
> >
> >
> > We should be careful here.  If this assumes loading from Parquet or other
> > file formats currently in the library, arbitrarily changing the type to
> > load the minimum data-length possible could break users, this should
> > probably be a configuration option.  This also reminds me I think there is
> > some technical debt with decimals and parquet.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-12022
> >
> > On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky <
> > krassovskysa...@gmail.com>
> > wrote:
> >
> > > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > > it’ll help achieve better performance on TPC-H (and maybe other
> > > benchmarks). The decimal columns need only 12 digits of precision, for
> > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > > decimal to be faster.
> > >
> > > Sasha Krassovsky
> > >
> > > > 8 марта 2022 г., в 09:01, Micah Kornfield <emkornfi...@gmail.com>
> > > написал(а):
> > > >
> > > > 
> > > >>
> > > >>
> > > >> Do we want to keep the historical "C++ and Java" requirement or
> > > >> do we want to make it a more flexible "two independent official
> > > >> implementations", which could be for example C++ and Rust, Rust and
> > > >> Java, etc.
> > > >
> > > >
> > > > I think flexibility here is a good idea, I'd like to hear other
> > opinions.
> > > >
> > > > For this particular case if there aren't volunteers to help out in
> > > another
> > > > implementation I'm willing to help with Java (I don't have bandwidth to
> > > > do both C++ and Java).
> > > >
> > > > Cheers,
> > > > -Micah
> > > >
> > > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >>
> > > >>
> > > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
> > > >>>>
> > > >>>> Relaxing from {128,256} to {32,64,128,256} seems a low risk
> > > >>>> from an integration perspective, as implementations already need to
> > > read
> > > >>>> the bitwidth to select the appropriate physical representation (if
> > > they
> > > >>>> support it).
> > > >>>
> > > >>> I think there are two reasons for having implementations first.
> > > >>> 1.  Lower risk bugs in implementation/spec.
> > > >>> 2.  A mechanism to ensure that there is some boot-strapped coverage
> > in
> > > >>> commonly used reference implementations.
> > > >>
> > > >> That sounds reasonable.
> > > >>
> > > >> Another question that came to my mind is: traditionally, we've
> > mandated
> > > >> implementations in the two reference Arrow implementations (C++ and
> > > >> Java).  However, our implementation landscape is now much richer than
> > it
> > > >> used to be (for example, there is a tremendous activity on the Rust
> > > >> side).  Do we want to keep the historical "C++ and Java" requirement
> > or
> > > >> do we want to make it a more flexible "two independent official
> > > >> implementations", which could be for example C++ and Rust, Rust and
> > > >> Java, etc.
> > > >>
> > > >> (by "independent" I mean that one should not be based on the other,
> > for
> > > >> example it should not be "C++ and Python" :-))
> > > >>
> > > >> Regards
> > > >>
> > > >> Antoine.
> > > >>
> > > >>
> > > >>>
> > > >>> I agree 1, is fairly low-risk.
> > > >>>
> > > >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
> > > >>> jorgecarlei...@gmail.com> wrote:
> > > >>>
> > > >>>> +1 adding 32 and 64 bit decimals.
> > > >>>>
> > > >>>> +0 to release it without integration tests - both IPC and the C data
> > > >>>> interface use a variable bit width to declare the appropriate size
> > for
> > > >>>> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a
> > low
> > > >> risk
> > > >>>> from an integration perspective, as implementations already need to
> > > read
> > > >>>> the bitwidth to select the appropriate physical representation (if
> > > they
> > > >>>> support it).
> > > >>>>
> > > >>>> Best,
> > > >>>> Jorge
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou <anto...@python.org>
> > wrote:
> > > >>>>
> > > >>>>>
> > > >>>>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit :
> > > >>>>>> I think this makes sense to add these.  Typically when adding new
> > > >>>> types,
> > > >>>>>> we've waited  on the official vote until there are two reference
> > > >>>>>> implementations demonstrating compatibility.
> > > >>>>>
> > > >>>>> You are right, I had forgotten about that.  Though in this case, it
> > > >>>>> might be argued we are just relaxing the constraints on an existing
> > > >> type.
> > > >>>>>
> > > >>>>> What do others think?
> > > >>>>>
> > > >>>>> Regards
> > > >>>>>
> > > >>>>> Antoine.
> > > >>>>>
> > > >>>>>
> > > >>>>>>
> > > >>>>>> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou <anto...@python.org
> > >
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> Hello,
> > > >>>>>>>
> > > >>>>>>> Currently, the Arrow format specification restricts the bitwidth
> > of
> > > >>>>>>> decimal numbers to either 128 or 256 bits.
> > > >>>>>>>
> > > >>>>>>> However, there is interest in allowing other bitwidths, at least
> > 32
> > > >>>> and
> > > >>>>>>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
> > > >>>>>>> datatype would allow for precisions of up to 18 digits
> > > (respectively
> > > >> 9
> > > >>>>>>> digits), which are sufficient for some applications which are
> > > mainly
> > > >>>>>>> looking for exact computations rather than sheer precision.
> > > >> Obviously,
> > > >>>>>>> smaller datatypes are cheaper to store in memory and cheaper to
> > run
> > > >>>>>>> computations on.
> > > >>>>>>>
> > > >>>>>>> For example, the Spark documentation mentions that some decimal
> > > types
> > > >>>>>>> may fit in a Java int (32 bits) or long (64 bits):
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> > https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html
> > > >>>>>>>
> > > >>>>>>> ... and a draft PR had even been filed for initial support in the
> > > C++
> > > >>>>>>> implementation (https://github.com/apache/arrow/pull/8578).
> > > >>>>>>>
> > > >>>>>>> I am therefore proposing that we relax the wording in the Arrow
> > > >> format
> > > >>>>>>> specification to also allow 32- and 64-bit decimal types.
> > > >>>>>>>
> > > >>>>>>> This is a preliminary discussion to gather opinions and potential
> > > >>>>>>> counter-arguments against this proposal. If no strong
> > > >> counter-argument
> > > >>>>>>> emerges, we will probably run a vote in a week or two.
> > > >>>>>>>
> > > >>>>>>> Best regards
> > > >>>>>>>
> > > >>>>>>> Antoine.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> >

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Reply via email to