Hi Wes,

Thanks a lot for the comments.
You are right. This can be applied to the data encoding/compression, and I
think this is one of the building blocks for encoding/compression.

In the short term, it will provide conversions between the two memory
layouts of run length vectors.
In the mid term, it can help to reduce the network traffic for
varchar/varbinary vectors.
In the long term, it will provide compression for more scenarios.

The basic idea is based on the observations that, a vector usually has a
smaller width after converting to a delta vector.

For example, for a varchar vector with a large number of elements, the
offset buffer will use 4 bytes for each element.
However it is likely that, the strings in the vectors are not big (less
than 65536 in length). So by converting the offset buffer to a delta
vector, we can use a int vector with 2 byte width.

Best,
Liya Fan


On Thu, Sep 5, 2019 at 3:05 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi,
>
> Having utility algorithms to perform data transformations seems fine
> if there is a use for them and maintaining the code in the Arrow
> libraries makes sense.
>
> I don't understand point #2 "We can transform them to delta vectors
> before IPC". It sounds like you are proposing a data compression
> technique. Should this be a part of the
> sparseness/encoding/compression discussion?
>
> - Wes
>
> On Sun, Sep 1, 2019 at 10:14 PM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > Dear all,
> >
> > We want to support a feature for conversions between delta vector and
> > partial sum vector. Please give your valuable feedback.
> >
> > Best,
> >
> > Liya Fan
> >
> > What is a delta vector/partial sum vector?
> >
> > Given an integer vector a with length n, its partial sum vector is
> another
> > integer vector b with length n + 1, with values defined as:
> >
> > b(0) = initial sum
> > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n
> >
> > Given an integer vector with length n + 1, its delta vector is another
> > integer vector b with length n, with values defined as:
> >
> > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1
> >
> > In this issue, we provide utilities to convert between vector and partial
> > sum vector. It is interesting to note that the two operations
> corresponding
> > to the discrete integration and differentian.
> >
> > These conversions have wide applications. For example,
> >
> >    1.
> >
> >    The run-length vector proposed by Micah is based on the partial sum
> >    vector, while the deduplication functionality is based on delta
> vector.
> >    This issue provides conversions between them.
> >    2.
> >
> >    The current VarCharVector/VarBinaryVector implementations are based on
> >    partial sum vector. We can transform them to delta vectors before
> IPC, to
> >    reduce network traffic.
> >    3.
> >
> >    Converting to delta can be considered as a way for data compression.
> To
> >    further reduce the data volume, the operation can be applied more than
> >    once, to further reduce data volume.
> >
> > Points to discuss:
> > The API should be provided at the level of vector or ArrowBuf, or both?
> > 1. If it is based on vector, there can be performance overhead due to
> > virtual method calls.
> > 2. If it is base on ArrowBuf, some underlying details (type width) are
> > exposed to the end user, which is not compliant with the principle of
> > encapsulation.
>

Reply via email to