Option 3 is the what the columnar specification currently intends, for
the reasons that Jacques cites. In particular, a value can be made
null only by altering the validity bitmap. It might be helpful to add
some language to make clear that the contents "underneath" a null can
be anything. The same is true of other memory layouts also, including
primitive.

On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <liya.fa...@gmail.com> wrote:
>
> Hi Jacques and Ravindra,
>
> Thanks for your valuable feedback.
>
> Please let me talk more about contiguous memory:
> For some operations (like memory segment comparison, hash code computation,
> etc.), if we we chose option 1 or 2, we can get the result with a single
> call, without any reference to the validity buffer.
>
> With option 3, we need to split the memory into continuous regions
> separated by undefined regions (based on validity buffer), and then we
> calculate the result for each region and finally combine them. This is less
> efficient.
>
> Ravindra's idea sounds interesting, especially when most values are null or
> non-null.
>
> What do you think?
>
> Best,
> Liya Fan
>
> On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <ravin...@dremio.com>
> wrote:
>
> > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > > Dear all,
> > >
> > > In the discussion of this PR (https://github.com/apache/arrow/pull/5073
> > ),
> > > we are faced with a problem:
> > >
> > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is
> > > supposed to take no space in the data buffer. In particular, for a null
> > > value, we have
> > >
> > > start index == end index
> > >
> > > Where start index and end index are the start/end positions of the value
> > in
> > > the data buffer. This problem is also related to the ListVector.
> > >
> > > However, it seems that for some scenarios, a null value can take
> > non-empty
> > > space (please see this comment
> > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).
> > >
> > > Since this is an important issue, we should make it clear in the
> > > specification. Otherwise, some unexpected problems may occur in client
> > > code.
> > >
> > > It seems we are faced with 3 options:
> > >
> > > 1. a null value always takes no space.
> > > 2. a null value can take non-empty space, and the content of the
> > non-empty
> > > space is always 0.
> > > 3. a null value can take non-empty space, and the content of the
> > non-empty
> > > space is undefined.
> > >
> > > Option 1 makes the data buffer of a VariableWidthVector a continuous
> > region
> > > (not interleaved by undefined regions). So optimization can be applied.
> >
> > However, it may lead to memory copy/move (as indicated in the above comment
> > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
> > >
> > > Option 3 can address the above problem of memory copy/move. However, it
> > > splits memory into un-continuous regions, so optimizations cannot be
> > > performed. In addition, it may cause unexpected problems in client code.
> > >
> >
> > We could still apply the optimisation for the contiguous "valid regions".
> > eg. if the entire vector is valid (called array in cpp), then compare data
> > buffers. If there are only two null entries in the vector, compare the
> > three consecutive regions in the data buffer, ..
> >
> >
> >
> > >
> > > Option 2 seems like a trade-off between the two. However, it is not
> > > suitable for ListVector.
> > >
> > > Please give your valuable feedback.
> > >
> > > Best,
> > > Liya Fan
> > >
> >
> >
> > --
> > Thanks and regards,
> > Ravindra.
> >

Reply via email to