Hi Jacques and Ravindra, Thanks for your valuable feedback.
Please let me talk more about contiguous memory: For some operations (like memory segment comparison, hash code computation, etc.), if we we chose option 1 or 2, we can get the result with a single call, without any reference to the validity buffer. With option 3, we need to split the memory into continuous regions separated by undefined regions (based on validity buffer), and then we calculate the result for each region and finally combine them. This is less efficient. Ravindra's idea sounds interesting, especially when most values are null or non-null. What do you think? Best, Liya Fan On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <ravin...@dremio.com> wrote: > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <liya.fa...@gmail.com> wrote: > > > Dear all, > > > > In the discussion of this PR (https://github.com/apache/arrow/pull/5073 > ), > > we are faced with a problem: > > > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is > > supposed to take no space in the data buffer. In particular, for a null > > value, we have > > > > start index == end index > > > > Where start index and end index are the start/end positions of the value > in > > the data buffer. This problem is also related to the ListVector. > > > > However, it seems that for some scenarios, a null value can take > non-empty > > space (please see this comment > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491). > > > > Since this is an important issue, we should make it clear in the > > specification. Otherwise, some unexpected problems may occur in client > > code. > > > > It seems we are faced with 3 options: > > > > 1. a null value always takes no space. > > 2. a null value can take non-empty space, and the content of the > non-empty > > space is always 0. > > 3. a null value can take non-empty space, and the content of the > non-empty > > space is undefined. > > > > Option 1 makes the data buffer of a VariableWidthVector a continuous > region > > (not interleaved by undefined regions). So optimization can be applied. > > However, it may lead to memory copy/move (as indicated in the above comment > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491) > > > > Option 3 can address the above problem of memory copy/move. However, it > > splits memory into un-continuous regions, so optimizations cannot be > > performed. In addition, it may cause unexpected problems in client code. > > > > We could still apply the optimisation for the contiguous "valid regions". > eg. if the entire vector is valid (called array in cpp), then compare data > buffers. If there are only two null entries in the vector, compare the > three consecutive regions in the data buffer, .. > > > > > > > Option 2 seems like a trade-off between the two. However, it is not > > suitable for ListVector. > > > > Please give your valuable feedback. > > > > Best, > > Liya Fan > > > > > -- > Thanks and regards, > Ravindra. >