I’d also like to chime in in favor of 32- and 64-bit decimals because it’ll help achieve better performance on TPC-H (and maybe other benchmarks). The decimal columns need only 12 digits of precision, for which a 64-bit decimal is sufficient. It’s currently wasteful to use a 128-bit decimal. You can technically use a float too, but I expect 64-bit decimal to be faster.
Sasha Krassovsky > 8 марта 2022 г., в 09:01, Micah Kornfield <emkornfi...@gmail.com> написал(а): > > >> >> >> Do we want to keep the historical "C++ and Java" requirement or >> do we want to make it a more flexible "two independent official >> implementations", which could be for example C++ and Rust, Rust and >> Java, etc. > > > I think flexibility here is a good idea, I'd like to hear other opinions. > > For this particular case if there aren't volunteers to help out in another > implementation I'm willing to help with Java (I don't have bandwidth to > do both C++ and Java). > > Cheers, > -Micah > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou <anto...@python.org> wrote: >> >> >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit : >>>> >>>> Relaxing from {128,256} to {32,64,128,256} seems a low risk >>>> from an integration perspective, as implementations already need to read >>>> the bitwidth to select the appropriate physical representation (if they >>>> support it). >>> >>> I think there are two reasons for having implementations first. >>> 1. Lower risk bugs in implementation/spec. >>> 2. A mechanism to ensure that there is some boot-strapped coverage in >>> commonly used reference implementations. >> >> That sounds reasonable. >> >> Another question that came to my mind is: traditionally, we've mandated >> implementations in the two reference Arrow implementations (C++ and >> Java). However, our implementation landscape is now much richer than it >> used to be (for example, there is a tremendous activity on the Rust >> side). Do we want to keep the historical "C++ and Java" requirement or >> do we want to make it a more flexible "two independent official >> implementations", which could be for example C++ and Rust, Rust and >> Java, etc. >> >> (by "independent" I mean that one should not be based on the other, for >> example it should not be "C++ and Python" :-)) >> >> Regards >> >> Antoine. >> >> >>> >>> I agree 1, is fairly low-risk. >>> >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão < >>> jorgecarlei...@gmail.com> wrote: >>> >>>> +1 adding 32 and 64 bit decimals. >>>> >>>> +0 to release it without integration tests - both IPC and the C data >>>> interface use a variable bit width to declare the appropriate size for >>>> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low >> risk >>>> from an integration perspective, as implementations already need to read >>>> the bitwidth to select the appropriate physical representation (if they >>>> support it). >>>> >>>> Best, >>>> Jorge >>>> >>>> >>>> >>>> >>>> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou <anto...@python.org> wrote: >>>> >>>>> >>>>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit : >>>>>> I think this makes sense to add these. Typically when adding new >>>> types, >>>>>> we've waited on the official vote until there are two reference >>>>>> implementations demonstrating compatibility. >>>>> >>>>> You are right, I had forgotten about that. Though in this case, it >>>>> might be argued we are just relaxing the constraints on an existing >> type. >>>>> >>>>> What do others think? >>>>> >>>>> Regards >>>>> >>>>> Antoine. >>>>> >>>>> >>>>>> >>>>>> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou <anto...@python.org> >>>>> wrote: >>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Currently, the Arrow format specification restricts the bitwidth of >>>>>>> decimal numbers to either 128 or 256 bits. >>>>>>> >>>>>>> However, there is interest in allowing other bitwidths, at least 32 >>>> and >>>>>>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal >>>>>>> datatype would allow for precisions of up to 18 digits (respectively >> 9 >>>>>>> digits), which are sufficient for some applications which are mainly >>>>>>> looking for exact computations rather than sheer precision. >> Obviously, >>>>>>> smaller datatypes are cheaper to store in memory and cheaper to run >>>>>>> computations on. >>>>>>> >>>>>>> For example, the Spark documentation mentions that some decimal types >>>>>>> may fit in a Java int (32 bits) or long (64 bits): >>>>>>> >>>>>>> >>>>> >>>> >> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html >>>>>>> >>>>>>> ... and a draft PR had even been filed for initial support in the C++ >>>>>>> implementation (https://github.com/apache/arrow/pull/8578). >>>>>>> >>>>>>> I am therefore proposing that we relax the wording in the Arrow >> format >>>>>>> specification to also allow 32- and 64-bit decimal types. >>>>>>> >>>>>>> This is a preliminary discussion to gather opinions and potential >>>>>>> counter-arguments against this proposal. If no strong >> counter-argument >>>>>>> emerges, we will probably run a vote in a week or two. >>>>>>> >>>>>>> Best regards >>>>>>> >>>>>>> Antoine. >>>>>>> >>>>>> >>>>> >>>> >>> >>