Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Sasha Krassovsky Tue, 08 Mar 2022 11:05:01 -0800

I’d also like to chime in in favor of 32- and 64-bit decimals because it’ll 
help achieve better performance on TPC-H (and maybe other benchmarks). The 
decimal columns need only 12 digits of precision, for which a 64-bit decimal is 
sufficient. It’s currently wasteful to use a 128-bit decimal. You can 
technically use a float too, but I expect 64-bit decimal to be faster.


Sasha Krassovsky

> 8 марта 2022 г., в 09:01, Micah Kornfield <emkornfi...@gmail.com> написал(а):
> 
> 
>> 
>> 
>> Do we want to keep the historical "C++ and Java" requirement or
>> do we want to make it a more flexible "two independent official
>> implementations", which could be for example C++ and Rust, Rust and
>> Java, etc.
> 
> 
> I think flexibility here is a good idea, I'd like to hear other opinions.
> 
> For this particular case if there aren't volunteers to help out in another
> implementation I'm willing to help with Java (I don't have bandwidth to
> do both C++ and Java).
> 
> Cheers,
> -Micah
> 
>> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou <anto...@python.org> wrote:
>> 
>> 
>> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
>>>> 
>>>> Relaxing from {128,256} to {32,64,128,256} seems a low risk
>>>> from an integration perspective, as implementations already need to read
>>>> the bitwidth to select the appropriate physical representation (if they
>>>> support it).
>>> 
>>> I think there are two reasons for having implementations first.
>>> 1.  Lower risk bugs in implementation/spec.
>>> 2.  A mechanism to ensure that there is some boot-strapped coverage in
>>> commonly used reference implementations.
>> 
>> That sounds reasonable.
>> 
>> Another question that came to my mind is: traditionally, we've mandated
>> implementations in the two reference Arrow implementations (C++ and
>> Java).  However, our implementation landscape is now much richer than it
>> used to be (for example, there is a tremendous activity on the Rust
>> side).  Do we want to keep the historical "C++ and Java" requirement or
>> do we want to make it a more flexible "two independent official
>> implementations", which could be for example C++ and Rust, Rust and
>> Java, etc.
>> 
>> (by "independent" I mean that one should not be based on the other, for
>> example it should not be "C++ and Python" :-))
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 
>>> 
>>> I agree 1, is fairly low-risk.
>>> 
>>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
>>> jorgecarlei...@gmail.com> wrote:
>>> 
>>>> +1 adding 32 and 64 bit decimals.
>>>> 
>>>> +0 to release it without integration tests - both IPC and the C data
>>>> interface use a variable bit width to declare the appropriate size for
>>>> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low
>> risk
>>>> from an integration perspective, as implementations already need to read
>>>> the bitwidth to select the appropriate physical representation (if they
>>>> support it).
>>>> 
>>>> Best,
>>>> Jorge
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou <anto...@python.org> wrote:
>>>> 
>>>>> 
>>>>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit :
>>>>>> I think this makes sense to add these.  Typically when adding new
>>>> types,
>>>>>> we've waited  on the official vote until there are two reference
>>>>>> implementations demonstrating compatibility.
>>>>> 
>>>>> You are right, I had forgotten about that.  Though in this case, it
>>>>> might be argued we are just relaxing the constraints on an existing
>> type.
>>>>> 
>>>>> What do others think?
>>>>> 
>>>>> Regards
>>>>> 
>>>>> Antoine.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou <anto...@python.org>
>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> Currently, the Arrow format specification restricts the bitwidth of
>>>>>>> decimal numbers to either 128 or 256 bits.
>>>>>>> 
>>>>>>> However, there is interest in allowing other bitwidths, at least 32
>>>> and
>>>>>>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
>>>>>>> datatype would allow for precisions of up to 18 digits (respectively
>> 9
>>>>>>> digits), which are sufficient for some applications which are mainly
>>>>>>> looking for exact computations rather than sheer precision.
>> Obviously,
>>>>>>> smaller datatypes are cheaper to store in memory and cheaper to run
>>>>>>> computations on.
>>>>>>> 
>>>>>>> For example, the Spark documentation mentions that some decimal types
>>>>>>> may fit in a Java int (32 bits) or long (64 bits):
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html
>>>>>>> 
>>>>>>> ... and a draft PR had even been filed for initial support in the C++
>>>>>>> implementation (https://github.com/apache/arrow/pull/8578).
>>>>>>> 
>>>>>>> I am therefore proposing that we relax the wording in the Arrow
>> format
>>>>>>> specification to also allow 32- and 64-bit decimal types.
>>>>>>> 
>>>>>>> This is a preliminary discussion to gather opinions and potential
>>>>>>> counter-arguments against this proposal. If no strong
>> counter-argument
>>>>>>> emerges, we will probably run a vote in a week or two.
>>>>>>> 
>>>>>>> Best regards
>>>>>>> 
>>>>>>> Antoine.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Reply via email to