Re: [Format][C++] Offering limited support for unsigned dictionary indices

Paul Taylor Fri, 26 Jun 2020 13:23:35 -0700

If positive integers are expected, I'm in favor of supporting unsignedindex types. I was surprised at Arrow C++ restriction on signed indicesin the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.

Based on the flatbuffer schemas, dictionary indices could technically beany Arrow type, which I assumed was to allow for morecomplex/exotic/custom indexing schemes in the future. JS will allow youto specify _any_ Arrow type as the dictionary codes, though using anon-numeric type without a custom enumerator is UB.

I'm also curious about how the restriction on dictionary index typesinteracts with streaming delta dictionaries. In theory, you could have astreaming data source produce enough delta dictionaries such that thetotal dictionary size grows beyond 2^31-1 elements.

I think that's a valid use-case of delta dictionaries assuming Arrowaggregates the dictionaries into multiple RecordBatches (or aChunkedArray), which is what JS does. But if that were allowed, we wouldhave to allow 64-bit (signed or unsigned) dictionary index types.


Paul


On 6/26/20 5:58 AM, Wes McKinney wrote:

hi folks,

At the moment, using unsigned integers for dictionary indices/codes
isn't exactly forbidden by the metadata [1], which says that the
indices must be "positive integers". Meanwhile, the columnar format
specification says

"When a field is dictionary encoded, the values are represented by an
array of signed integers representing the index of the value in the
dictionary. The memory layout for a dictionary-encoded array is the
same as that of a primitive signed integer layout."

I was looking at a discussion in RAPIDS about this topic [2]

When we drafted the columnar specification for this, the intention as
I recall was only to support signed integer indices. The rationale
basically is that:

* Better cross platform / language support for signed (e.g. certainly
in the JVM)
* Supporting 4 instead of 8 index types is less burdensome for the
developer and means less code to generate to support them
* Unsigned wraparound bugs could pass silently

I think it would be feasible to support the unsigned indices with the
following caveats:

* Signed integers are recommended as the "compatible" and preferred choice
* Most algorithms in the reference libraries should choose signed over
unsigned when generating indices
* Libraries may choose to promote unsigned to signed (e.g. in Java) if
they don't support unsigned well

I can't say I'm thrilled about having to maintain extra code for the
unsigned case, but it also seems like it would not do great harm
overall. Additionally, if you are certain that the indices are all
non-negative, then you can use the same code to process both intX and
uintX -- we use this trick in the Take implementation in C++ to
generate half as much binary code.

Thoughts?

Thanks
Wes

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
[2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Re: [Format][C++] Offering limited support for unsigned dictionary indices

Reply via email to