Hi Micah, Thanks for your comments.
I am aware that you have invested lots of time and effort in reviewing the algorithm related code. We really appreciate it. Thank you so much. I agree with you that the plan document is a good idea. In general, the algorithms are driven by applications, so it is difficult to give a precise plan. However, I am going to prepare a document about the requirement/design/implementation of the algorithms. Hope that will make discussions/code review more efficient. Best, Liya Fan On Thu, Sep 5, 2019 at 11:46 AM Fan Liya <liya.fa...@gmail.com> wrote: > Hi Wes, > > Thanks a lot for the comments. > You are right. This can be applied to the data encoding/compression, and I > think this is one of the building blocks for encoding/compression. > > In the short term, it will provide conversions between the two memory > layouts of run length vectors. > In the mid term, it can help to reduce the network traffic for > varchar/varbinary vectors. > In the long term, it will provide compression for more scenarios. > > The basic idea is based on the observations that, a vector usually has a > smaller width after converting to a delta vector. > > For example, for a varchar vector with a large number of elements, the > offset buffer will use 4 bytes for each element. > However it is likely that, the strings in the vectors are not big (less > than 65536 in length). So by converting the offset buffer to a delta > vector, we can use a int vector with 2 byte width. > > Best, > Liya Fan > > > On Thu, Sep 5, 2019 at 3:05 AM Wes McKinney <wesmck...@gmail.com> wrote: > >> hi, >> >> Having utility algorithms to perform data transformations seems fine >> if there is a use for them and maintaining the code in the Arrow >> libraries makes sense. >> >> I don't understand point #2 "We can transform them to delta vectors >> before IPC". It sounds like you are proposing a data compression >> technique. Should this be a part of the >> sparseness/encoding/compression discussion? >> >> - Wes >> >> On Sun, Sep 1, 2019 at 10:14 PM Fan Liya <liya.fa...@gmail.com> wrote: >> > >> > Dear all, >> > >> > We want to support a feature for conversions between delta vector and >> > partial sum vector. Please give your valuable feedback. >> > >> > Best, >> > >> > Liya Fan >> > >> > What is a delta vector/partial sum vector? >> > >> > Given an integer vector a with length n, its partial sum vector is >> another >> > integer vector b with length n + 1, with values defined as: >> > >> > b(0) = initial sum >> > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n >> > >> > Given an integer vector with length n + 1, its delta vector is another >> > integer vector b with length n, with values defined as: >> > >> > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1 >> > >> > In this issue, we provide utilities to convert between vector and >> partial >> > sum vector. It is interesting to note that the two operations >> corresponding >> > to the discrete integration and differentian. >> > >> > These conversions have wide applications. For example, >> > >> > 1. >> > >> > The run-length vector proposed by Micah is based on the partial sum >> > vector, while the deduplication functionality is based on delta >> vector. >> > This issue provides conversions between them. >> > 2. >> > >> > The current VarCharVector/VarBinaryVector implementations are based >> on >> > partial sum vector. We can transform them to delta vectors before >> IPC, to >> > reduce network traffic. >> > 3. >> > >> > Converting to delta can be considered as a way for data compression. >> To >> > further reduce the data volume, the operation can be applied more >> than >> > once, to further reduce data volume. >> > >> > Points to discuss: >> > The API should be provided at the level of vector or ArrowBuf, or both? >> > 1. If it is based on vector, there can be performance overhead due to >> > virtual method calls. >> > 2. If it is base on ArrowBuf, some underlying details (type width) are >> > exposed to the end user, which is not compliant with the principle of >> > encapsulation. >> >