Any update on this proposal? I think this will be a useful addition too. I can potentially help on the Rust side implementation.
Chao On Tue, Mar 8, 2022 at 1:00 PM Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > > Agreed. > > Also, I would like to revise my previous comment about the small risk. > While prototyping this I did hit some bumps. They primary came from two > reasons: > > * I was unable to find arrow/json files in the arrow-testing generated > files with a non-default decimal bitwidth (I think we only have the > on-the-fly generated file in archery) > * the FFI interface has a default decimal of 128 (`d:{precision}:{scale}`) > and implementations may not support the 256 case (e.g. Rust has no native > i256). For these cases, this could be the first non-default decimal > implementation. > > So, maybe we follow the standard procedure? > > Best, > Jorge > > > > On Tue, Mar 8, 2022 at 9:22 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > > > > I’d also like to chime in in favor of 32- and 64-bit decimals because > > > it’ll help achieve better performance on TPC-H (and maybe other > > > benchmarks). The decimal columns need only 12 digits of precision, for > > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a > > > 128-bit decimal. You can technically use a float too, but I expect 64-bit > > > decimal to be faster. > > > > > > We should be careful here. If this assumes loading from Parquet or other > > file formats currently in the library, arbitrarily changing the type to > > load the minimum data-length possible could break users, this should > > probably be a configuration option. This also reminds me I think there is > > some technical debt with decimals and parquet. > > > > [1] https://issues.apache.org/jira/browse/ARROW-12022 > > > > On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky < > > krassovskysa...@gmail.com> > > wrote: > > > > > I’d also like to chime in in favor of 32- and 64-bit decimals because > > > it’ll help achieve better performance on TPC-H (and maybe other > > > benchmarks). The decimal columns need only 12 digits of precision, for > > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a > > > 128-bit decimal. You can technically use a float too, but I expect 64-bit > > > decimal to be faster. > > > > > > Sasha Krassovsky > > > > > > > 8 марта 2022 г., в 09:01, Micah Kornfield <emkornfi...@gmail.com> > > > написал(а): > > > > > > > > > > > >> > > > >> > > > >> Do we want to keep the historical "C++ and Java" requirement or > > > >> do we want to make it a more flexible "two independent official > > > >> implementations", which could be for example C++ and Rust, Rust and > > > >> Java, etc. > > > > > > > > > > > > I think flexibility here is a good idea, I'd like to hear other > > opinions. > > > > > > > > For this particular case if there aren't volunteers to help out in > > > another > > > > implementation I'm willing to help with Java (I don't have bandwidth to > > > > do both C++ and Java). > > > > > > > > Cheers, > > > > -Micah > > > > > > > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou <anto...@python.org> > > > wrote: > > > >> > > > >> > > > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit : > > > >>>> > > > >>>> Relaxing from {128,256} to {32,64,128,256} seems a low risk > > > >>>> from an integration perspective, as implementations already need to > > > read > > > >>>> the bitwidth to select the appropriate physical representation (if > > > they > > > >>>> support it). > > > >>> > > > >>> I think there are two reasons for having implementations first. > > > >>> 1. Lower risk bugs in implementation/spec. > > > >>> 2. A mechanism to ensure that there is some boot-strapped coverage > > in > > > >>> commonly used reference implementations. > > > >> > > > >> That sounds reasonable. > > > >> > > > >> Another question that came to my mind is: traditionally, we've > > mandated > > > >> implementations in the two reference Arrow implementations (C++ and > > > >> Java). However, our implementation landscape is now much richer than > > it > > > >> used to be (for example, there is a tremendous activity on the Rust > > > >> side). Do we want to keep the historical "C++ and Java" requirement > > or > > > >> do we want to make it a more flexible "two independent official > > > >> implementations", which could be for example C++ and Rust, Rust and > > > >> Java, etc. > > > >> > > > >> (by "independent" I mean that one should not be based on the other, > > for > > > >> example it should not be "C++ and Python" :-)) > > > >> > > > >> Regards > > > >> > > > >> Antoine. > > > >> > > > >> > > > >>> > > > >>> I agree 1, is fairly low-risk. > > > >>> > > > >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão < > > > >>> jorgecarlei...@gmail.com> wrote: > > > >>> > > > >>>> +1 adding 32 and 64 bit decimals. > > > >>>> > > > >>>> +0 to release it without integration tests - both IPC and the C data > > > >>>> interface use a variable bit width to declare the appropriate size > > for > > > >>>> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a > > low > > > >> risk > > > >>>> from an integration perspective, as implementations already need to > > > read > > > >>>> the bitwidth to select the appropriate physical representation (if > > > they > > > >>>> support it). > > > >>>> > > > >>>> Best, > > > >>>> Jorge > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou <anto...@python.org> > > wrote: > > > >>>> > > > >>>>> > > > >>>>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit : > > > >>>>>> I think this makes sense to add these. Typically when adding new > > > >>>> types, > > > >>>>>> we've waited on the official vote until there are two reference > > > >>>>>> implementations demonstrating compatibility. > > > >>>>> > > > >>>>> You are right, I had forgotten about that. Though in this case, it > > > >>>>> might be argued we are just relaxing the constraints on an existing > > > >> type. > > > >>>>> > > > >>>>> What do others think? > > > >>>>> > > > >>>>> Regards > > > >>>>> > > > >>>>> Antoine. > > > >>>>> > > > >>>>> > > > >>>>>> > > > >>>>>> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou <anto...@python.org > > > > > > >>>>> wrote: > > > >>>>>> > > > >>>>>>> > > > >>>>>>> Hello, > > > >>>>>>> > > > >>>>>>> Currently, the Arrow format specification restricts the bitwidth > > of > > > >>>>>>> decimal numbers to either 128 or 256 bits. > > > >>>>>>> > > > >>>>>>> However, there is interest in allowing other bitwidths, at least > > 32 > > > >>>> and > > > >>>>>>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal > > > >>>>>>> datatype would allow for precisions of up to 18 digits > > > (respectively > > > >> 9 > > > >>>>>>> digits), which are sufficient for some applications which are > > > mainly > > > >>>>>>> looking for exact computations rather than sheer precision. > > > >> Obviously, > > > >>>>>>> smaller datatypes are cheaper to store in memory and cheaper to > > run > > > >>>>>>> computations on. > > > >>>>>>> > > > >>>>>>> For example, the Spark documentation mentions that some decimal > > > types > > > >>>>>>> may fit in a Java int (32 bits) or long (64 bits): > > > >>>>>>> > > > >>>>>>> > > > >>>>> > > > >>>> > > > >> > > > > > https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html > > > >>>>>>> > > > >>>>>>> ... and a draft PR had even been filed for initial support in the > > > C++ > > > >>>>>>> implementation (https://github.com/apache/arrow/pull/8578). > > > >>>>>>> > > > >>>>>>> I am therefore proposing that we relax the wording in the Arrow > > > >> format > > > >>>>>>> specification to also allow 32- and 64-bit decimal types. > > > >>>>>>> > > > >>>>>>> This is a preliminary discussion to gather opinions and potential > > > >>>>>>> counter-arguments against this proposal. If no strong > > > >> counter-argument > > > >>>>>>> emerges, we will probably run a vote in a week or two. > > > >>>>>>> > > > >>>>>>> Best regards > > > >>>>>>> > > > >>>>>>> Antoine. > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > >