Hi Micah,

Thanks for your comments.

I am aware that you have invested lots of time and effort in reviewing the
algorithm related code.
We really appreciate it. Thank you so much.

I agree with you that the plan document is a good idea.
In general, the algorithms are driven by applications, so it is difficult
to give a precise plan.
However, I am going to prepare a document about the
requirement/design/implementation of the algorithms.

Hope that will make discussions/code review more efficient.

Best,
Liya Fan

On Thu, Sep 5, 2019 at 11:46 AM Fan Liya <liya.fa...@gmail.com> wrote:

> Hi Wes,
>
> Thanks a lot for the comments.
> You are right. This can be applied to the data encoding/compression, and I
> think this is one of the building blocks for encoding/compression.
>
> In the short term, it will provide conversions between the two memory
> layouts of run length vectors.
> In the mid term, it can help to reduce the network traffic for
> varchar/varbinary vectors.
> In the long term, it will provide compression for more scenarios.
>
> The basic idea is based on the observations that, a vector usually has a
> smaller width after converting to a delta vector.
>
> For example, for a varchar vector with a large number of elements, the
> offset buffer will use 4 bytes for each element.
> However it is likely that, the strings in the vectors are not big (less
> than 65536 in length). So by converting the offset buffer to a delta
> vector, we can use a int vector with 2 byte width.
>
> Best,
> Liya Fan
>
>
> On Thu, Sep 5, 2019 at 3:05 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi,
>>
>> Having utility algorithms to perform data transformations seems fine
>> if there is a use for them and maintaining the code in the Arrow
>> libraries makes sense.
>>
>> I don't understand point #2 "We can transform them to delta vectors
>> before IPC". It sounds like you are proposing a data compression
>> technique. Should this be a part of the
>> sparseness/encoding/compression discussion?
>>
>> - Wes
>>
>> On Sun, Sep 1, 2019 at 10:14 PM Fan Liya <liya.fa...@gmail.com> wrote:
>> >
>> > Dear all,
>> >
>> > We want to support a feature for conversions between delta vector and
>> > partial sum vector. Please give your valuable feedback.
>> >
>> > Best,
>> >
>> > Liya Fan
>> >
>> > What is a delta vector/partial sum vector?
>> >
>> > Given an integer vector a with length n, its partial sum vector is
>> another
>> > integer vector b with length n + 1, with values defined as:
>> >
>> > b(0) = initial sum
>> > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n
>> >
>> > Given an integer vector with length n + 1, its delta vector is another
>> > integer vector b with length n, with values defined as:
>> >
>> > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1
>> >
>> > In this issue, we provide utilities to convert between vector and
>> partial
>> > sum vector. It is interesting to note that the two operations
>> corresponding
>> > to the discrete integration and differentian.
>> >
>> > These conversions have wide applications. For example,
>> >
>> >    1.
>> >
>> >    The run-length vector proposed by Micah is based on the partial sum
>> >    vector, while the deduplication functionality is based on delta
>> vector.
>> >    This issue provides conversions between them.
>> >    2.
>> >
>> >    The current VarCharVector/VarBinaryVector implementations are based
>> on
>> >    partial sum vector. We can transform them to delta vectors before
>> IPC, to
>> >    reduce network traffic.
>> >    3.
>> >
>> >    Converting to delta can be considered as a way for data compression.
>> To
>> >    further reduce the data volume, the operation can be applied more
>> than
>> >    once, to further reduce data volume.
>> >
>> > Points to discuss:
>> > The API should be provided at the level of vector or ArrowBuf, or both?
>> > 1. If it is based on vector, there can be performance overhead due to
>> > virtual method calls.
>> > 2. If it is base on ArrowBuf, some underlying details (type width) are
>> > exposed to the end user, which is not compliant with the principle of
>> > encapsulation.
>>
>

Reply via email to