Hi all,

Sorry for the late reply!

I would lean towards signed integers because we don't use unsigned
integers anywhere in the existing specification (other than as a data
type). While they are allowed as dictionary index values, the spec
specifically discourages their use [1]. If the times have changed and
this is no longer the case perhaps there should be a wider effort to
support unsigned values in other places that extend beyond a single
type?

-dewey

[1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L453-L457

On Tue, Sep 19, 2023 at 5:26 PM Benjamin Kietzman <bengil...@gmail.com> wrote:
>
> Hello again all,
>
> It seems there hasn't been much interest in this point so I'm leaning
> toward keeping unsigned integers. If anyone has a concern please respond
> here and/or on the PR [1].
>
> Sincerely,
> Ben Kietzman
>
> [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
>
> On Thu, Sep 14, 2023 at 9:31 AM David Li <lidav...@apache.org> wrote:
>
> > I think Java was usually raised as the odd child out when this has come up
> > before. Since Java 8 there are standard library methods to manipulate
> > signed integers as if they were unsigned, so in principle Java shouldn't be
> > a blocker anymore.
> >
> > That said, ByteBuffer is still indexed by int so in practice Java wouldn't
> > be able to handle more than 2 GB in a single buffer, at least until we can
> > use the Java 21+ APIs (MemorySegment is finally indexed by (signed) long).
> >
> > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote:
> > > Hello all,
> > >
> > > Utf8View was recently accepted [1] and I've opened a PR to add the
> > > spec/schema changes [2]. In review [3], it was requested that signed 32
> > bit
> > > integers be used for the fields of view structs instead of 32 bit
> > unsigned.
> > >
> > > This divergence has been discussed on the ML previously [4], but in light
> > > of my reviewer's request for a change it should be raised again for
> > focused
> > > discussion. (At this stage, I don't *think* the change would require
> > > another vote.) I'll enumerate the motivations for signed and unsigned as
> > I
> > > understand them.
> > >
> > > Signed:
> > > - signed integers are conventional in the arrow format
> > > - unsigned integers may cause some difficulty of implementation in
> > > languages which don't natively support them
> > >
> > > Unsigned:
> > > - unsigned integers are used by engines which already implement Utf8View
> > >
> > > My own bias is toward compatibility with existing implementers, but using
> > > signed integers will only affect the case of arrays which include data
> > > buffers larger than 2GB. For reference, the default buffer size in velox
> > is
> > > 32KB so such a massive data buffer would only occur when a single slot
> > of a
> > > string array has 2.1GB of characters. This seems sufficiently unlikely
> > that
> > > I wouldn't consider it a blocker.
> > >
> > > Sincerely,
> > > Ben Kietzman
> > >
> > > [1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6
> > > [2] https://github.com/apache/arrow/pull/37526
> > > [3] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> > > [4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > [5]
> > >
> > https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50
> >

Reply via email to