Re: [LAST CALL][DISCUSS] Unsigned integers in Utf8View

Matt Topol Wed, 20 Sep 2023 14:42:29 -0700

Just to chime in (and add yet another voice into the mix here), I'd have a
preference for it being signed integers for the same reasons as most
everyone else: consistency with everything else in the spec. Since we use
signed integers everywhere, I'd prefer to keep it consistent rather than
introduce an exception.


--Matt

On Wed, Sep 20, 2023 at 11:19 AM Benjamin Kietzman <bengil...@gmail.com>
wrote:

> Hello all,
>
> Thanks for the input!
>
> @Will
>
> > Could we name which ones it would be compatible with?
>
> UmbraDB [1], Velox [2], and DuckDB [3] all use unsigned integers for size.
>
> > As I understood it originally, the current implementations use raw
> pointers
>
> Yes, these implementations use raw pointers exclusively and for
> transmission via IPC
> those will require conversion. The compatibility we'd achieve by using
> unsigned
> integers is mostly one of documentation and validation; the above engines
> would
> accept a string as long as `2**32 - 1`, but with signed integers arrow
> would reject
> it. This is a distant edge case so I'd say it's acceptable to exclude it.
> (For completeness it should be noted that conversion between signed and
> unsigned is
> typically a trivial operation and shouldn't be considered as adding to any
> runtime cost.)
>
> @Dewey
> > ... the spec specifically discourages their use.
> > If the times have changed and this is no longer the case perhaps there
> > should be a wider effort to support unsigned values in other places
>
> I'm not sure which signed integers you're thinking of. At the risk of
> constructing a
> strawman, I would say that for example we shouldn't extend the offsets in
> the List type
> to support unsigned integers. This would require adding a new type named
> UnsignedOffsetsList or so, and I don't think that would offer a useful
> difference from
> the existing List type. I would recommend waiting for implementers to
> motivate such a
> change by proposing new cases where unsigned integers would be useful to
> support
> (as with unsigned dictionary indices).
>
> > I would lean towards signed integers because we don't use unsigned
> > integers anywhere in the existing specification
>
> It's true this is the convention in the spec. Since you and @pitrou have
> both mentioned
> it (and we haven't heard strong opinions in support of unsigned sizes), I
> think that's
> sufficient grounds to avoid the exception.
>
> Sincerely,
> Ben Kietzman
>
> [1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf (section 3.1,
> second paragraph)
> [2]
>
> https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L255
> [3]
>
> https://github.com/duckdb/duckdb/blob/5a29c99891dcc71d19ce222ee3a68bf08686680e/src/include/duckdb/common/types/string_type.hpp#L210
>
> On Tue, Sep 19, 2023 at 4:50 PM Will Jones <will.jones...@gmail.com>
> wrote:
>
> > Hi Ben,
> >
> > I'm open to the idea of using unsigned if it allows compatibility with an
> > existing implementation or two. Could we name which ones it would be
> > compatible with? Links to implementation code would be very welcome, if
> > available.
> >
> > As I understood it originally, the current implementations use raw
> pointers
> > rather than offsets into buffers, so these values already weren't
> > compatible to begin with. For example, it seems like Velox uses raw
> > pointers into buffers [1]. So unless I'm missing something, it seems like
> > these implementations will have to map the pointers to offsets anyways,
> so
> > maybe it's not much more trouble to convert to signed integers on the
> way.
> >
> > Best,
> >
> > Will Jones
> >
> > [1]
> >
> >
> https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L46-L78
> >
> > On Tue, Sep 19, 2023 at 8:26 AM Benjamin Kietzman <bengil...@gmail.com>
> > wrote:
> >
> > > Hello again all,
> > >
> > > It seems there hasn't been much interest in this point so I'm leaning
> > > toward keeping unsigned integers. If anyone has a concern please
> respond
> > > here and/or on the PR [1].
> > >
> > > Sincerely,
> > > Ben Kietzman
> > >
> > > [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> > >
> > > On Thu, Sep 14, 2023 at 9:31 AM David Li <lidav...@apache.org> wrote:
> > >
> > > > I think Java was usually raised as the odd child out when this has
> come
> > > up
> > > > before. Since Java 8 there are standard library methods to manipulate
> > > > signed integers as if they were unsigned, so in principle Java
> > shouldn't
> > > be
> > > > a blocker anymore.
> > > >
> > > > That said, ByteBuffer is still indexed by int so in practice Java
> > > wouldn't
> > > > be able to handle more than 2 GB in a single buffer, at least until
> we
> > > can
> > > > use the Java 21+ APIs (MemorySegment is finally indexed by (signed)
> > > long).
> > > >
> > > > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote:
> > > > > Hello all,
> > > > >
> > > > > Utf8View was recently accepted [1] and I've opened a PR to add the
> > > > > spec/schema changes [2]. In review [3], it was requested that
> signed
> > 32
> > > > bit
> > > > > integers be used for the fields of view structs instead of 32 bit
> > > > unsigned.
> > > > >
> > > > > This divergence has been discussed on the ML previously [4], but in
> > > light
> > > > > of my reviewer's request for a change it should be raised again for
> > > > focused
> > > > > discussion. (At this stage, I don't *think* the change would
> require
> > > > > another vote.) I'll enumerate the motivations for signed and
> unsigned
> > > as
> > > > I
> > > > > understand them.
> > > > >
> > > > > Signed:
> > > > > - signed integers are conventional in the arrow format
> > > > > - unsigned integers may cause some difficulty of implementation in
> > > > > languages which don't natively support them
> > > > >
> > > > > Unsigned:
> > > > > - unsigned integers are used by engines which already implement
> > > Utf8View
> > > > >
> > > > > My own bias is toward compatibility with existing implementers, but
> > > using
> > > > > signed integers will only affect the case of arrays which include
> > data
> > > > > buffers larger than 2GB. For reference, the default buffer size in
> > > velox
> > > > is
> > > > > 32KB so such a massive data buffer would only occur when a single
> > slot
> > > > of a
> > > > > string array has 2.1GB of characters. This seems sufficiently
> > unlikely
> > > > that
> > > > > I wouldn't consider it a blocker.
> > > > >
> > > > > Sincerely,
> > > > > Ben Kietzman
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6
> > > > > [2] https://github.com/apache/arrow/pull/37526
> > > > > [3]
> > https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> > > > > [4]
> https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > > > [5]
> > > > >
> > > >
> > >
> >
> https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50
> > > >
> > >
> >
>

Re: [LAST CALL][DISCUSS] Unsigned integers in Utf8View

Reply via email to