Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-28 Thread Wes McKinney
I opened a PR https://github.com/apache/arrow/pull/7567 We should probably vote on this so I will wait another day or so before opening a vote so we can get this change through for the release On Sun, Jun 28, 2020 at 9:38 AM Wes McKinney wrote: > > I'll propose a patch to Columnar.rst about thi

Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-28 Thread Wes McKinney
I'll propose a patch to Columnar.rst about this. I think it's OK to allow uint64 indices (so we don't have to revisit the issue in a year or two or...) but the specification should say that they are not preferred. On Fri, Jun 26, 2020 at 7:20 PM Paul Taylor wrote: > > Responding to this comment f

Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Paul Taylor
Responding to this comment from GitHub[1]: If we had to make a bet about what % of dictionaries empirically are between 128 and 255 elements, I would bet that the percentage is small. If it turned out that 40% of dictionaries fell in that range then I would agree that this makes sense. I agr

Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Wes McKinney
I think that situations where you need uint64 indices are likely to be exceedingly esoteric. I would recommend that the specification advise against use of 64-bit indices at all unless that are actually needed to represent the data (i.e. dictionaries have more than INT32_MAX / UINT32_MAX elements,

Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Paul Taylor
If positive integers are expected, I'm in favor of supporting unsigned index types. I was surprised at Arrow C++ restriction on signed indices in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS. Based on the flatbuffer schemas, dictionary indices could technically be a

Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Micah Kornfield
I think in the interest of not having the spec fork we should probably do this. It is partially our fault for not providing better documentation in Schema.fbs (and potentially more thorough integration tests). Maybe we should explicitly disallow uint64 which provides the biggest headache for the

[Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Wes McKinney
hi folks, At the moment, using unsigned integers for dictionary indices/codes isn't exactly forbidden by the metadata [1], which says that the indices must be "positive integers". Meanwhile, the columnar format specification says "When a field is dictionary encoded, the values are represented by