Hi all,

I’d like to propose the following changes to the in-memory Decimal128
format and solicit feedback.

   1. When converting to and from an array of bytes, the input bytes are
   *assumed* to be in big-endian order and the output bytes are *guaranteed*
   to be in big-endian order. Additionally, the number of bytes is not assumed
   to be 16 anymore, and the length must be passed in to the Decimal128
   constructor.
   2. The byte width of the underlying FixedSizeBinary builder is the
   minimum number of bytes necessary to represent a Decimal128 of a given
   precision. For example, if my decimal type has precision 21, the byte width
   of my FixedSizeBinary builder will be 9.

Both of these changes are motivated by the Parquet file format decimal
specification and the impala parquet reader/writer.

Pros:

   1. Parquet writes from arrow arrays write the minimum number of bytes to
   disk, saving space and there’s no additional logic required in parquet-cpp.
   2. parquet-cpp would be able to easily read from parquet files written
   from impala and easily write impala compatible parquet files

Cons:

   1. Additional work is required in the java implementation to implement
   this since right now decimals are always 16 bytes.
   2. Integration tests are broken because decimal byte widths are now not
   reliably 16 bytes.
   3. Possibly worse in-memory performance because of array-element byte
   widths that are not powers of two.

Possible alternatives:

   - Implement the logic only in parquet-cpp
      - I originally went down this road, but it was significantly more
      complicated to do this parquet-cpp than it was to change the
arrow decimal
      format. I’d be willing to push harder on this if changing the
arrow decimal
      format in the proposed way is not a viable option.

I have WIP PR up that implements this on the arrow side, but I’m happy to
can it (or not) based on this discussion (
https://github.com/apache/arrow/pull/1108).

Thanks,
Phillip
​

Reply via email to