Hi Wes, Thanks a lot for the comments. You are right. This can be applied to the data encoding/compression, and I think this is one of the building blocks for encoding/compression.
In the short term, it will provide conversions between the two memory layouts of run length vectors. In the mid term, it can help to reduce the network traffic for varchar/varbinary vectors. In the long term, it will provide compression for more scenarios. The basic idea is based on the observations that, a vector usually has a smaller width after converting to a delta vector. For example, for a varchar vector with a large number of elements, the offset buffer will use 4 bytes for each element. However it is likely that, the strings in the vectors are not big (less than 65536 in length). So by converting the offset buffer to a delta vector, we can use a int vector with 2 byte width. Best, Liya Fan On Thu, Sep 5, 2019 at 3:05 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi, > > Having utility algorithms to perform data transformations seems fine > if there is a use for them and maintaining the code in the Arrow > libraries makes sense. > > I don't understand point #2 "We can transform them to delta vectors > before IPC". It sounds like you are proposing a data compression > technique. Should this be a part of the > sparseness/encoding/compression discussion? > > - Wes > > On Sun, Sep 1, 2019 at 10:14 PM Fan Liya <liya.fa...@gmail.com> wrote: > > > > Dear all, > > > > We want to support a feature for conversions between delta vector and > > partial sum vector. Please give your valuable feedback. > > > > Best, > > > > Liya Fan > > > > What is a delta vector/partial sum vector? > > > > Given an integer vector a with length n, its partial sum vector is > another > > integer vector b with length n + 1, with values defined as: > > > > b(0) = initial sum > > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n > > > > Given an integer vector with length n + 1, its delta vector is another > > integer vector b with length n, with values defined as: > > > > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1 > > > > In this issue, we provide utilities to convert between vector and partial > > sum vector. It is interesting to note that the two operations > corresponding > > to the discrete integration and differentian. > > > > These conversions have wide applications. For example, > > > > 1. > > > > The run-length vector proposed by Micah is based on the partial sum > > vector, while the deduplication functionality is based on delta > vector. > > This issue provides conversions between them. > > 2. > > > > The current VarCharVector/VarBinaryVector implementations are based on > > partial sum vector. We can transform them to delta vectors before > IPC, to > > reduce network traffic. > > 3. > > > > Converting to delta can be considered as a way for data compression. > To > > further reduce the data volume, the operation can be applied more than > > once, to further reduce data volume. > > > > Points to discuss: > > The API should be provided at the level of vector or ArrowBuf, or both? > > 1. If it is based on vector, there can be performance overhead due to > > virtual method calls. > > 2. If it is base on ArrowBuf, some underlying details (type width) are > > exposed to the end user, which is not compliant with the principle of > > encapsulation. >