I'll propose a patch to Columnar.rst about this. I think it's OK to allow uint64 indices (so we don't have to revisit the issue in a year or two or...) but the specification should say that they are not preferred.
On Fri, Jun 26, 2020 at 7:20 PM Paul Taylor <ptaylor.apa...@gmail.com> wrote: > > Responding to this comment from GitHub[1]: > > > If we had to make a bet about what % of dictionaries empirically are > > between 128 and 255 elements, I would bet that the percentage is > > small. If it turned out that 40% of dictionaries fell in that range > > then I would agree that this makes sense. > > I agree, the case where you'd use uint8 vs. int16 isn't very motivating, > since those are probably small tables or temporary allocations inside > larger operations. > > The case where you'd use uint16 vs. int32 is more motivating, since > dictionaries between 32767 and 65535 elements could easily back larger > columns and saving up to 1GiB on the keys can quickly add up. > > > I would recommend that the specification advise > > against use of 64-bit indices at all unless that are actually needed > > to represent the data (i.e. dictionaries have more than INT32_MAX / > > UINT32_MAX elements > > Agreed. > > > but doesn't strike me as being a common occurrence > > This is somewhat common in certain graph operations on large datasets, > but I concede I may not be a representative sample of Arrow users :) > > Paul > > 1. https://github.com/rapidsai/cudf/pull/5501#issuecomment-649936352 > > On 6/26/20 1:53 PM, Wes McKinney wrote: > > I think that situations where you need uint64 indices are likely to be > > exceedingly esoteric. I would recommend that the specification advise > > against use of 64-bit indices at all unless that are actually needed > > to represent the data (i.e. dictionaries have more than INT32_MAX / > > UINT32_MAX elements, but doesn't strike me as being a common > > occurrence) > > > > IIUC, the only change that would be necessary would be a revision to > > Columnar.rst and perhaps some language augmentation in Schema.fbs, and > > we might want to add some integration tests for unsigned indices to > > probe whether each language supports them. The changes needed to > > support unsigned integers in C++ are probably not that invasive but I > > haven't taken a close look at it yet > > > > On Fri, Jun 26, 2020 at 3:23 PM Paul Taylor <ptaylor.apa...@gmail.com> > > wrote: > >> If positive integers are expected, I'm in favor of supporting unsigned > >> index types. I was surprised at Arrow C++ restriction on signed indices > >> in the RAPIDS thread, perhaps it's newer than when I ported the logic in > >> JS. > >> > >> Based on the flatbuffer schemas, dictionary indices could technically be > >> any Arrow type, which I assumed was to allow for more > >> complex/exotic/custom indexing schemes in the future. JS will allow you > >> to specify _any_ Arrow type as the dictionary codes, though using a > >> non-numeric type without a custom enumerator is UB. > >> > >> I'm also curious about how the restriction on dictionary index types > >> interacts with streaming delta dictionaries. In theory, you could have a > >> streaming data source produce enough delta dictionaries such that the > >> total dictionary size grows beyond 2^31-1 elements. > >> > >> I think that's a valid use-case of delta dictionaries assuming Arrow > >> aggregates the dictionaries into multiple RecordBatches (or a > >> ChunkedArray), which is what JS does. But if that were allowed, we would > >> have to allow 64-bit (signed or unsigned) dictionary index types. > >> > >> Paul > >> > >> > >> On 6/26/20 5:58 AM, Wes McKinney wrote: > >>> hi folks, > >>> > >>> At the moment, using unsigned integers for dictionary indices/codes > >>> isn't exactly forbidden by the metadata [1], which says that the > >>> indices must be "positive integers". Meanwhile, the columnar format > >>> specification says > >>> > >>> "When a field is dictionary encoded, the values are represented by an > >>> array of signed integers representing the index of the value in the > >>> dictionary. The memory layout for a dictionary-encoded array is the > >>> same as that of a primitive signed integer layout." > >>> > >>> I was looking at a discussion in RAPIDS about this topic [2] > >>> > >>> When we drafted the columnar specification for this, the intention as > >>> I recall was only to support signed integer indices. The rationale > >>> basically is that: > >>> > >>> * Better cross platform / language support for signed (e.g. certainly > >>> in the JVM) > >>> * Supporting 4 instead of 8 index types is less burdensome for the > >>> developer and means less code to generate to support them > >>> * Unsigned wraparound bugs could pass silently > >>> > >>> I think it would be feasible to support the unsigned indices with the > >>> following caveats: > >>> > >>> * Signed integers are recommended as the "compatible" and preferred choice > >>> * Most algorithms in the reference libraries should choose signed over > >>> unsigned when generating indices > >>> * Libraries may choose to promote unsigned to signed (e.g. in Java) if > >>> they don't support unsigned well > >>> > >>> I can't say I'm thrilled about having to maintain extra code for the > >>> unsigned case, but it also seems like it would not do great harm > >>> overall. Additionally, if you are certain that the indices are all > >>> non-negative, then you can use the same code to process both intX and > >>> uintX -- we use this trick in the Take implementation in C++ to > >>> generate half as much binary code. > >>> > >>> Thoughts? > >>> > >>> Thanks > >>> Wes > >>> > >>> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280 > >>> [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509