Thanks for the information and ideas, I need to check them out (especially one with structures). PR proposal for RLE is very interesting since internally people express interest in this feature. For intern, I thought to ask to work primarily on data structures level (like array adapter or something like that). So I haven't thought about communication layer, but it is a useful feature per se. However it might have limited value in terms of contribution to Arrow and, hence, not that attractive for an intern.
On Sat, Mar 27, 2021 at 12:50 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I made a proposal a while ago that covers a form of RLE encoding [1]. I > haven't had time to work on it, since it is a substantial effort to > implement. > > I wouldn't expect an intern to be able to complete the work necessary to > get this merged over the course of a normal 3 month internship. > > [1] https://github.com/apache/arrow/pull/4815/files > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Would it be an option to use a StructArray for that? One array with the > > values, and one with the repetitions: > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) -> > > > > StructArray([ > > "values": Int32([1, 2, 3, 1, 2]), > > "repetitions": UInt32([1, 3, 5, 1, 1]), > > ]) > > > > It does not have the same API, but I think that the physical operations > > would be different, anyways: ("array + 2" would only operate on > "values"). > > I think that a small struct / object with some operator overloading would > > address this, and writing something on the metadata would allow others to > > consume it, a-la extension type? > > > > On a related note, such encoding would address DataFusion's issue of > > representing scalars / constant arrays: a constant array would be > > represented as a repetition. Currently we just unpack (i.e. allocate) a > > constant array when we want to transfer through a RecordBatch. > > > > Best, > > Jorge > > > > > > > > > > On Thu, Mar 25, 2021, 10:03 Kirill Lykov <lykov.kir...@gmail.com> wrote: > > > > > Thanks for the answer. > > > I asked about it because we need it and I was about writing a summer > > intern > > > proposal for a student to work on it. > > > Looks like it could work fine. > > > > > > On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > The SparseTensor stuff is something else entirely (that's matrices > > > > where the entries are mostly 0) > > > > > > > > There isn't anything to help you right now aside from dictionary > > > > encoding — if your dictionary has 256 elements or less, you can use > > > > uint8 index type and thus have 1 byte per value. We've discussed > > > > implementing RLE in the project and so if we do that in the future > > > > then a random access data structure could be built on top of RLE (in > > > > principle) > > > > > > > > On Wed, Mar 24, 2021 at 8:53 AM Niranda Perera < > > niranda.per...@gmail.com > > > > > > > > wrote: > > > > > > > > > > Hi Lykov, > > > > > > > > > > I believe there's an arrow sparse tensor abstraction. > > > > > > > > > > On Wed, Mar 24, 2021, 05:05 Kirill Lykov <lykov.kir...@gmail.com> > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I wonder if there is an existing way to store floats/ints with > many > > > > > > repetitions in some container (not sure about terminology). > > > > > > For example, I might have data like A=[1, 2, 2, 2, 3, 3, 3, 3, 3, > > 3, > > > > 1, 2] > > > > > > and i would like to store only B=[1, 2, 3, 1, 2] but from user > > > > > > perspective it behaves like container A. I know I can use > > dictionary > > > > but as > > > > > > far I understand it will store internally indices of the chosen > > > > elements so > > > > > > it makes sense more for binary data or structures. > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Kirill Lykov > > > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Kirill Lykov > > > > > > -- Best regards, Kirill Lykov