I opened a PR
https://github.com/apache/arrow/pull/7567
We should probably vote on this so I will wait another day or so
before opening a vote so we can get this change through for the
release
On Sun, Jun 28, 2020 at 9:38 AM Wes McKinney wrote:
>
> I'll propose a patch to Columnar.rst about thi
I'll propose a patch to Columnar.rst about this. I think it's OK to
allow uint64 indices (so we don't have to revisit the issue in a year
or two or...) but the specification should say that they are not
preferred.
On Fri, Jun 26, 2020 at 7:20 PM Paul Taylor wrote:
>
> Responding to this comment f
Responding to this comment from GitHub[1]:
If we had to make a bet about what % of dictionaries empirically are
between 128 and 255 elements, I would bet that the percentage is
small. If it turned out that 40% of dictionaries fell in that range
then I would agree that this makes sense.
I agr
I think that situations where you need uint64 indices are likely to be
exceedingly esoteric. I would recommend that the specification advise
against use of 64-bit indices at all unless that are actually needed
to represent the data (i.e. dictionaries have more than INT32_MAX /
UINT32_MAX elements,
If positive integers are expected, I'm in favor of supporting unsigned
index types. I was surprised at Arrow C++ restriction on signed indices
in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.
Based on the flatbuffer schemas, dictionary indices could technically be
a
I think in the interest of not having the spec fork we should probably do
this. It is partially our fault for not providing better documentation in
Schema.fbs (and potentially more thorough integration tests).
Maybe we should explicitly disallow uint64 which provides the biggest
headache for the
hi folks,
At the moment, using unsigned integers for dictionary indices/codes
isn't exactly forbidden by the metadata [1], which says that the
indices must be "positive integers". Meanwhile, the columnar format
specification says
"When a field is dictionary encoded, the values are represented by