Hello all,

Thanks for the input!

@Will

> Could we name which ones it would be compatible with?

UmbraDB [1], Velox [2], and DuckDB [3] all use unsigned integers for size.

> As I understood it originally, the current implementations use raw
pointers

Yes, these implementations use raw pointers exclusively and for
transmission via IPC
those will require conversion. The compatibility we'd achieve by using
unsigned
integers is mostly one of documentation and validation; the above engines
would
accept a string as long as `2**32 - 1`, but with signed integers arrow
would reject
it. This is a distant edge case so I'd say it's acceptable to exclude it.
(For completeness it should be noted that conversion between signed and
unsigned is
typically a trivial operation and shouldn't be considered as adding to any
runtime cost.)

@Dewey
> ... the spec specifically discourages their use.
> If the times have changed and this is no longer the case perhaps there
> should be a wider effort to support unsigned values in other places

I'm not sure which signed integers you're thinking of. At the risk of
constructing a
strawman, I would say that for example we shouldn't extend the offsets in
the List type
to support unsigned integers. This would require adding a new type named
UnsignedOffsetsList or so, and I don't think that would offer a useful
difference from
the existing List type. I would recommend waiting for implementers to
motivate such a
change by proposing new cases where unsigned integers would be useful to
support
(as with unsigned dictionary indices).

> I would lean towards signed integers because we don't use unsigned
> integers anywhere in the existing specification

It's true this is the convention in the spec. Since you and @pitrou have
both mentioned
it (and we haven't heard strong opinions in support of unsigned sizes), I
think that's
sufficient grounds to avoid the exception.

Sincerely,
Ben Kietzman

[1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf (section 3.1,
second paragraph)
[2]
https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L255
[3]
https://github.com/duckdb/duckdb/blob/5a29c99891dcc71d19ce222ee3a68bf08686680e/src/include/duckdb/common/types/string_type.hpp#L210

On Tue, Sep 19, 2023 at 4:50 PM Will Jones <will.jones...@gmail.com> wrote:

> Hi Ben,
>
> I'm open to the idea of using unsigned if it allows compatibility with an
> existing implementation or two. Could we name which ones it would be
> compatible with? Links to implementation code would be very welcome, if
> available.
>
> As I understood it originally, the current implementations use raw pointers
> rather than offsets into buffers, so these values already weren't
> compatible to begin with. For example, it seems like Velox uses raw
> pointers into buffers [1]. So unless I'm missing something, it seems like
> these implementations will have to map the pointers to offsets anyways, so
> maybe it's not much more trouble to convert to signed integers on the way.
>
> Best,
>
> Will Jones
>
> [1]
>
> https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L46-L78
>
> On Tue, Sep 19, 2023 at 8:26 AM Benjamin Kietzman <bengil...@gmail.com>
> wrote:
>
> > Hello again all,
> >
> > It seems there hasn't been much interest in this point so I'm leaning
> > toward keeping unsigned integers. If anyone has a concern please respond
> > here and/or on the PR [1].
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> >
> > On Thu, Sep 14, 2023 at 9:31 AM David Li <lidav...@apache.org> wrote:
> >
> > > I think Java was usually raised as the odd child out when this has come
> > up
> > > before. Since Java 8 there are standard library methods to manipulate
> > > signed integers as if they were unsigned, so in principle Java
> shouldn't
> > be
> > > a blocker anymore.
> > >
> > > That said, ByteBuffer is still indexed by int so in practice Java
> > wouldn't
> > > be able to handle more than 2 GB in a single buffer, at least until we
> > can
> > > use the Java 21+ APIs (MemorySegment is finally indexed by (signed)
> > long).
> > >
> > > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote:
> > > > Hello all,
> > > >
> > > > Utf8View was recently accepted [1] and I've opened a PR to add the
> > > > spec/schema changes [2]. In review [3], it was requested that signed
> 32
> > > bit
> > > > integers be used for the fields of view structs instead of 32 bit
> > > unsigned.
> > > >
> > > > This divergence has been discussed on the ML previously [4], but in
> > light
> > > > of my reviewer's request for a change it should be raised again for
> > > focused
> > > > discussion. (At this stage, I don't *think* the change would require
> > > > another vote.) I'll enumerate the motivations for signed and unsigned
> > as
> > > I
> > > > understand them.
> > > >
> > > > Signed:
> > > > - signed integers are conventional in the arrow format
> > > > - unsigned integers may cause some difficulty of implementation in
> > > > languages which don't natively support them
> > > >
> > > > Unsigned:
> > > > - unsigned integers are used by engines which already implement
> > Utf8View
> > > >
> > > > My own bias is toward compatibility with existing implementers, but
> > using
> > > > signed integers will only affect the case of arrays which include
> data
> > > > buffers larger than 2GB. For reference, the default buffer size in
> > velox
> > > is
> > > > 32KB so such a massive data buffer would only occur when a single
> slot
> > > of a
> > > > string array has 2.1GB of characters. This seems sufficiently
> unlikely
> > > that
> > > > I wouldn't consider it a blocker.
> > > >
> > > > Sincerely,
> > > > Ben Kietzman
> > > >
> > > > [1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6
> > > > [2] https://github.com/apache/arrow/pull/37526
> > > > [3]
> https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> > > > [4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > > [5]
> > > >
> > >
> >
> https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50
> > >
> >
>

Reply via email to