hi,

Having utility algorithms to perform data transformations seems fine
if there is a use for them and maintaining the code in the Arrow
libraries makes sense.

I don't understand point #2 "We can transform them to delta vectors
before IPC". It sounds like you are proposing a data compression
technique. Should this be a part of the
sparseness/encoding/compression discussion?

- Wes

On Sun, Sep 1, 2019 at 10:14 PM Fan Liya <liya.fa...@gmail.com> wrote:
>
> Dear all,
>
> We want to support a feature for conversions between delta vector and
> partial sum vector. Please give your valuable feedback.
>
> Best,
>
> Liya Fan
>
> What is a delta vector/partial sum vector?
>
> Given an integer vector a with length n, its partial sum vector is another
> integer vector b with length n + 1, with values defined as:
>
> b(0) = initial sum
> b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n
>
> Given an integer vector with length n + 1, its delta vector is another
> integer vector b with length n, with values defined as:
>
> b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1
>
> In this issue, we provide utilities to convert between vector and partial
> sum vector. It is interesting to note that the two operations corresponding
> to the discrete integration and differentian.
>
> These conversions have wide applications. For example,
>
>    1.
>
>    The run-length vector proposed by Micah is based on the partial sum
>    vector, while the deduplication functionality is based on delta vector.
>    This issue provides conversions between them.
>    2.
>
>    The current VarCharVector/VarBinaryVector implementations are based on
>    partial sum vector. We can transform them to delta vectors before IPC, to
>    reduce network traffic.
>    3.
>
>    Converting to delta can be considered as a way for data compression. To
>    further reduce the data volume, the operation can be applied more than
>    once, to further reduce data volume.
>
> Points to discuss:
> The API should be provided at the level of vector or ArrowBuf, or both?
> 1. If it is based on vector, there can be performance overhead due to
> virtual method calls.
> 2. If it is base on ArrowBuf, some underlying details (type width) are
> exposed to the end user, which is not compliant with the principle of
> encapsulation.

Reply via email to