Option 3 is the what the columnar specification currently intends, for the reasons that Jacques cites. In particular, a value can be made null only by altering the validity bitmap. It might be helpful to add some language to make clear that the contents "underneath" a null can be anything. The same is true of other memory layouts also, including primitive.
On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <liya.fa...@gmail.com> wrote: > > Hi Jacques and Ravindra, > > Thanks for your valuable feedback. > > Please let me talk more about contiguous memory: > For some operations (like memory segment comparison, hash code computation, > etc.), if we we chose option 1 or 2, we can get the result with a single > call, without any reference to the validity buffer. > > With option 3, we need to split the memory into continuous regions > separated by undefined regions (based on validity buffer), and then we > calculate the result for each region and finally combine them. This is less > efficient. > > Ravindra's idea sounds interesting, especially when most values are null or > non-null. > > What do you think? > > Best, > Liya Fan > > On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <ravin...@dremio.com> > wrote: > > > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <liya.fa...@gmail.com> wrote: > > > > > Dear all, > > > > > > In the discussion of this PR (https://github.com/apache/arrow/pull/5073 > > ), > > > we are faced with a problem: > > > > > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is > > > supposed to take no space in the data buffer. In particular, for a null > > > value, we have > > > > > > start index == end index > > > > > > Where start index and end index are the start/end positions of the value > > in > > > the data buffer. This problem is also related to the ListVector. > > > > > > However, it seems that for some scenarios, a null value can take > > non-empty > > > space (please see this comment > > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491). > > > > > > Since this is an important issue, we should make it clear in the > > > specification. Otherwise, some unexpected problems may occur in client > > > code. > > > > > > It seems we are faced with 3 options: > > > > > > 1. a null value always takes no space. > > > 2. a null value can take non-empty space, and the content of the > > non-empty > > > space is always 0. > > > 3. a null value can take non-empty space, and the content of the > > non-empty > > > space is undefined. > > > > > > Option 1 makes the data buffer of a VariableWidthVector a continuous > > region > > > (not interleaved by undefined regions). So optimization can be applied. > > > > However, it may lead to memory copy/move (as indicated in the above comment > > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491) > > > > > > Option 3 can address the above problem of memory copy/move. However, it > > > splits memory into un-continuous regions, so optimizations cannot be > > > performed. In addition, it may cause unexpected problems in client code. > > > > > > > We could still apply the optimisation for the contiguous "valid regions". > > eg. if the entire vector is valid (called array in cpp), then compare data > > buffers. If there are only two null entries in the vector, compare the > > three consecutive regions in the data buffer, .. > > > > > > > > > > > > Option 2 seems like a trade-off between the two. However, it is not > > > suitable for ListVector. > > > > > > Please give your valuable feedback. > > > > > > Best, > > > Liya Fan > > > > > > > > > -- > > Thanks and regards, > > Ravindra. > >