I’ve also heard interest from folks in the academic database community about adding compressed (sparse) in memory structures to the Arrow format / specification, so I think it makes more sense to try to figure things out at the specification / protocol level and then work on an implementation. I agree this seems above and beyond what I would think an intern could accomplish in a 10-12 week period given the process that has been involved with other significant additions to Arrow over the last several years.
On Sat, Mar 27, 2021 at 9:40 AM Kirill Lykov <lykov.kir...@gmail.com> wrote: > Thanks for the information and ideas, I need to check them out (especially > one with structures). > PR proposal for RLE is very interesting since internally people express > interest in this feature. > For intern, I thought to ask to work primarily on data structures level > (like array adapter or something like that). > So I haven't thought about communication layer, but it is a useful feature > per se. > However it might have limited value in terms of contribution to Arrow and, > hence, not that attractive for an intern. > > On Sat, Mar 27, 2021 at 12:50 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > I made a proposal a while ago that covers a form of RLE encoding [1]. I > > haven't had time to work on it, since it is a substantial effort to > > implement. > > > > I wouldn't expect an intern to be able to complete the work necessary to > > get this merged over the course of a normal 3 month internship. > > > > [1] https://github.com/apache/arrow/pull/4815/files > > > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > > > Would it be an option to use a StructArray for that? One array with the > > > values, and one with the repetitions: > > > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) -> > > > > > > StructArray([ > > > "values": Int32([1, 2, 3, 1, 2]), > > > "repetitions": UInt32([1, 3, 5, 1, 1]), > > > ]) > > > > > > It does not have the same API, but I think that the physical operations > > > would be different, anyways: ("array + 2" would only operate on > > "values"). > > > I think that a small struct / object with some operator overloading > would > > > address this, and writing something on the metadata would allow others > to > > > consume it, a-la extension type? > > > > > > On a related note, such encoding would address DataFusion's issue of > > > representing scalars / constant arrays: a constant array would be > > > represented as a repetition. Currently we just unpack (i.e. allocate) a > > > constant array when we want to transfer through a RecordBatch. > > > > > > Best, > > > Jorge > > > > > > > > > > > > > > > On Thu, Mar 25, 2021, 10:03 Kirill Lykov <lykov.kir...@gmail.com> > wrote: > > > > > > > Thanks for the answer. > > > > I asked about it because we need it and I was about writing a summer > > > intern > > > > proposal for a student to work on it. > > > > Looks like it could work fine. > > > > > > > > On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > The SparseTensor stuff is something else entirely (that's matrices > > > > > where the entries are mostly 0) > > > > > > > > > > There isn't anything to help you right now aside from dictionary > > > > > encoding — if your dictionary has 256 elements or less, you can use > > > > > uint8 index type and thus have 1 byte per value. We've discussed > > > > > implementing RLE in the project and so if we do that in the future > > > > > then a random access data structure could be built on top of RLE > (in > > > > > principle) > > > > > > > > > > On Wed, Mar 24, 2021 at 8:53 AM Niranda Perera < > > > niranda.per...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > Hi Lykov, > > > > > > > > > > > > I believe there's an arrow sparse tensor abstraction. > > > > > > > > > > > > On Wed, Mar 24, 2021, 05:05 Kirill Lykov <lykov.kir...@gmail.com > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I wonder if there is an existing way to store floats/ints with > > many > > > > > > > repetitions in some container (not sure about terminology). > > > > > > > For example, I might have data like A=[1, 2, 2, 2, 3, 3, 3, 3, > 3, > > > 3, > > > > > 1, 2] > > > > > > > and i would like to store only B=[1, 2, 3, 1, 2] but from user > > > > > > > perspective it behaves like container A. I know I can use > > > dictionary > > > > > but as > > > > > > > far I understand it will store internally indices of the chosen > > > > > elements so > > > > > > > it makes sense more for binary data or structures. > > > > > > > > > > > > > > -- > > > > > > > Best regards, > > > > > > > Kirill Lykov > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Kirill Lykov > > > > > > > > > > > > -- > Best regards, > Kirill Lykov >