Hi Ben,

I'm open to the idea of using unsigned if it allows compatibility with an
existing implementation or two. Could we name which ones it would be
compatible with? Links to implementation code would be very welcome, if
available.

As I understood it originally, the current implementations use raw pointers
rather than offsets into buffers, so these values already weren't
compatible to begin with. For example, it seems like Velox uses raw
pointers into buffers [1]. So unless I'm missing something, it seems like
these implementations will have to map the pointers to offsets anyways, so
maybe it's not much more trouble to convert to signed integers on the way.

Best,

Will Jones

[1]
https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L46-L78

On Tue, Sep 19, 2023 at 8:26 AM Benjamin Kietzman <bengil...@gmail.com>
wrote:

> Hello again all,
>
> It seems there hasn't been much interest in this point so I'm leaning
> toward keeping unsigned integers. If anyone has a concern please respond
> here and/or on the PR [1].
>
> Sincerely,
> Ben Kietzman
>
> [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
>
> On Thu, Sep 14, 2023 at 9:31 AM David Li <lidav...@apache.org> wrote:
>
> > I think Java was usually raised as the odd child out when this has come
> up
> > before. Since Java 8 there are standard library methods to manipulate
> > signed integers as if they were unsigned, so in principle Java shouldn't
> be
> > a blocker anymore.
> >
> > That said, ByteBuffer is still indexed by int so in practice Java
> wouldn't
> > be able to handle more than 2 GB in a single buffer, at least until we
> can
> > use the Java 21+ APIs (MemorySegment is finally indexed by (signed)
> long).
> >
> > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote:
> > > Hello all,
> > >
> > > Utf8View was recently accepted [1] and I've opened a PR to add the
> > > spec/schema changes [2]. In review [3], it was requested that signed 32
> > bit
> > > integers be used for the fields of view structs instead of 32 bit
> > unsigned.
> > >
> > > This divergence has been discussed on the ML previously [4], but in
> light
> > > of my reviewer's request for a change it should be raised again for
> > focused
> > > discussion. (At this stage, I don't *think* the change would require
> > > another vote.) I'll enumerate the motivations for signed and unsigned
> as
> > I
> > > understand them.
> > >
> > > Signed:
> > > - signed integers are conventional in the arrow format
> > > - unsigned integers may cause some difficulty of implementation in
> > > languages which don't natively support them
> > >
> > > Unsigned:
> > > - unsigned integers are used by engines which already implement
> Utf8View
> > >
> > > My own bias is toward compatibility with existing implementers, but
> using
> > > signed integers will only affect the case of arrays which include data
> > > buffers larger than 2GB. For reference, the default buffer size in
> velox
> > is
> > > 32KB so such a massive data buffer would only occur when a single slot
> > of a
> > > string array has 2.1GB of characters. This seems sufficiently unlikely
> > that
> > > I wouldn't consider it a blocker.
> > >
> > > Sincerely,
> > > Ben Kietzman
> > >
> > > [1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6
> > > [2] https://github.com/apache/arrow/pull/37526
> > > [3] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> > > [4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > [5]
> > >
> >
> https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50
> >
>

Reply via email to