Re: sparse data array

Wes McKinney Sat, 27 Mar 2021 10:35:58 -0700

I’ve also heard interest from folks in the academic database community
about adding compressed (sparse) in memory structures to the Arrow format /
specification, so I think it makes more sense to try to figure things out
at the specification / protocol level and then work on an implementation. I
agree this seems above and beyond what I would think an intern could
accomplish in a 10-12 week period given the process that has been involved
with other significant additions to Arrow over the last several years.


On Sat, Mar 27, 2021 at 9:40 AM Kirill Lykov <[email protected]> wrote:

> Thanks for the information and ideas, I need to check them out (especially
> one with structures).
> PR proposal for RLE is very interesting since internally people express
> interest in this feature.
> For intern, I thought to ask to work primarily on data structures level
> (like array adapter or something like that).
> So I haven't thought about communication layer, but it is a useful feature
> per se.
> However it might have limited value in terms of contribution to Arrow and,
> hence, not that attractive for an intern.
>
> On Sat, Mar 27, 2021 at 12:50 AM Micah Kornfield <[email protected]>
> wrote:
>
> > I made a proposal a while ago that covers a form of RLE encoding [1].  I
> > haven't had time to work on it, since it is a substantial effort to
> > implement.
> >
> > I wouldn't expect an intern to be able to complete the work necessary to
> > get this merged over the course of a normal 3 month internship.
> >
> > [1] https://github.com/apache/arrow/pull/4815/files
> >
> > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > [email protected]> wrote:
> >
> > > Would it be an option to use a StructArray for that? One array with the
> > > values, and one with the repetitions:
> > >
> > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > >
> > > StructArray([
> > >     "values": Int32([1, 2, 3, 1, 2]),
> > >     "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > ])
> > >
> > > It does not have the same API, but I think that the physical operations
> > > would be different, anyways: ("array + 2" would only operate on
> > "values").
> > > I think that a small struct / object with some operator overloading
> would
> > > address this, and writing something on the metadata would allow others
> to
> > > consume it, a-la extension type?
> > >
> > > On a related note, such encoding would address DataFusion's issue of
> > > representing scalars / constant arrays: a constant array would be
> > > represented as a repetition. Currently we just unpack (i.e. allocate) a
> > > constant array when we want to transfer through a RecordBatch.
> > >
> > > Best,
> > > Jorge
> > >
> > >
> > >
> > >
> > > On Thu, Mar 25, 2021, 10:03 Kirill Lykov <[email protected]>
> wrote:
> > >
> > > > Thanks for the answer.
> > > > I asked about it because we need it and I was about writing a summer
> > > intern
> > > > proposal for a student to work on it.
> > > > Looks like it could work fine.
> > > >
> > > > On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney <[email protected]>
> > > wrote:
> > > >
> > > > > The SparseTensor stuff is something else entirely (that's matrices
> > > > > where the entries are mostly 0)
> > > > >
> > > > > There isn't anything to help you right now aside from dictionary
> > > > > encoding — if your dictionary has 256 elements or less, you can use
> > > > > uint8 index type and thus have 1 byte per value. We've discussed
> > > > > implementing RLE in the project and so if we do that in the future
> > > > > then a random access data structure could be built on top of RLE
> (in
> > > > > principle)
> > > > >
> > > > > On Wed, Mar 24, 2021 at 8:53 AM Niranda Perera <
> > > [email protected]
> > > > >
> > > > > wrote:
> > > > > >
> > > > > > Hi Lykov,
> > > > > >
> > > > > > I believe there's an arrow sparse tensor abstraction.
> > > > > >
> > > > > > On Wed, Mar 24, 2021, 05:05 Kirill Lykov <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I wonder if there is an existing way to store floats/ints with
> > many
> > > > > > > repetitions in some container (not sure about terminology).
> > > > > > > For example, I might have data like A=[1, 2, 2, 2, 3, 3, 3, 3,
> 3,
> > > 3,
> > > > > 1, 2]
> > > > > > > and i would like to store only B=[1, 2, 3, 1, 2] but from user
> > > > > > > perspective it behaves like container A. I know I can use
> > > dictionary
> > > > > but as
> > > > > > > far I understand it will store internally indices of the chosen
> > > > > elements so
> > > > > > > it makes sense more for binary data or structures.
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Kirill Lykov
> > > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Kirill Lykov
> > > >
> > >
> >
>
>
> --
> Best regards,
> Kirill Lykov
>

Re: sparse data array

Reply via email to