Re: sparse data array

bobtins Tue, 30 Mar 2021 17:09:03 -0700

>From your response, I'm inferring that in order to introduce this kind of 
>compression, support in the spec is needed, similar to how compression types 
>and parameters are enumerated in 
>https://github.com/apache/arrow/blob/master/format/Message.fbs. Any change in 
>the spec is a Big Deal (and it should be).


I can observe that in a project like Arrow, there is always a tension between 
compatibility and extensibility, and it makes me wonder if it would be helpful 
to add capabilities without changing the spec. An extension type can be defined 
in terms of one of the built-in layouts, but it would define semantics (such as 
compression) that would be used to interpret that layout. 

> > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > > [email protected]> wrote:
> > >
> > > > Would it be an option to use a StructArray for that? One array with the
> > > > values, and one with the repetitions:
> > > >
> > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > > >
> > > > StructArray([
> > > >     "values": Int32([1, 2, 3, 1, 2]),
> > > >     "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > > ])
> > > >
> > > > It does not have the same API, but I think that the physical operations
> > > > would be different, anyways: ("array + 2" would only operate on
> > > "values").
> > > > I think that a small struct / object with some operator overloading
> > would
> > > > address this, and writing something on the metadata would allow others
> > to
> > > > consume it, a-la extension type?
> > > >
> > > > On a related note, such encoding would address DataFusion's issue of
> > > > representing scalars / constant arrays: a constant array would be
> > > > represented as a repetition. Currently we just unpack (i.e. allocate) a
> > > > constant array when we want to transfer through a RecordBatch.
> > > >

I just reread the whole thread, and realized Jorge was saying a similar thing, 
that this new type could be built on an existing layout. But I guess I'm also 
imagining some more general capabilities:

* Define an extension type in terms of an existing layout
* Register an extension type implementation
* Enumerate available extension types

It would make life more complicated, for sure, but it would allow things like 
compression to evolve more quickly. I'm thinking about the in-memory 
implementation that I built, where I did various things to save memory. 

For example, integers are stored in blocks of 4096 values, with each block the 
minimum size to hold all the values. You access the value n with the expression 
"block[n >> 12][n % 4096]".
Take an example with 1M int32 values. Value 0 is 1e9 but all the others are 0 
to 9. 
Normally you would use 4M bytes to store these, but you could instead have 1 
block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage for block 
offsets, so a savings of almost 75%. If you could have uint4 blocks you could 
save about 87%.

On the other hand, I had the luxury of total control over the implementation, 
so it was easy for me to try something and make it work. For Arrow, having the 
memory layout standardized makes it easy to implement various SIMD and GPU 
optimizations, which all go out the window if you allow arbitrary new data 
access semantics. 

So if this idea is hopelessly naive and half-baked and would result in a train 
wreck, please feel free to enlighten me/rip me a new one--I'm here to learn.



On 2021/03/27 17:35:32, Wes McKinney <[email protected]> wrote: 
> I’ve also heard interest from folks in the academic database community
> about adding compressed (sparse) in memory structures to the Arrow format /
> specification, so I think it makes more sense to try to figure things out
> at the specification / protocol level and then work on an implementation. I
> agree this seems above and beyond what I would think an intern could
> accomplish in a 10-12 week period given the process that has been involved
> with other significant additions to Arrow over the last several years.
> 
> On Sat, Mar 27, 2021 at 9:40 AM Kirill Lykov <[email protected]> wrote:
> 
> > Thanks for the information and ideas, I need to check them out (especially
> > one with structures).
> > PR proposal for RLE is very interesting since internally people express
> > interest in this feature.
> > For intern, I thought to ask to work primarily on data structures level
> > (like array adapter or something like that).
> > So I haven't thought about communication layer, but it is a useful feature
> > per se.
> > However it might have limited value in terms of contribution to Arrow and,
> > hence, not that attractive for an intern.
> >
> > On Sat, Mar 27, 2021 at 12:50 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > I made a proposal a while ago that covers a form of RLE encoding [1].  I
> > > haven't had time to work on it, since it is a substantial effort to
> > > implement.
> > >
> > > I wouldn't expect an intern to be able to complete the work necessary to
> > > get this merged over the course of a normal 3 month internship.
> > >
> > > [1] https://github.com/apache/arrow/pull/4815/files
> > >
> > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > > [email protected]> wrote:
> > >
> > > > Would it be an option to use a StructArray for that? One array with the
> > > > values, and one with the repetitions:
> > > >
> > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > > >
> > > > StructArray([
> > > >     "values": Int32([1, 2, 3, 1, 2]),
> > > >     "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > > ])
> > > >
> > > > It does not have the same API, but I think that the physical operations
> > > > would be different, anyways: ("array + 2" would only operate on
> > > "values").
> > > > I think that a small struct / object with some operator overloading
> > would
> > > > address this, and writing something on the metadata would allow others
> > to
> > > > consume it, a-la extension type?
> > > >
> > > > On a related note, such encoding would address DataFusion's issue of
> > > > representing scalars / constant arrays: a constant array would be
> > > > represented as a repetition. Currently we just unpack (i.e. allocate) a
> > > > constant array when we want to transfer through a RecordBatch.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Mar 25, 2021, 10:03 Kirill Lykov <[email protected]>
> > wrote:
> > > >
> > > > > Thanks for the answer.
> > > > > I asked about it because we need it and I was about writing a summer
> > > > intern
> > > > > proposal for a student to work on it.
> > > > > Looks like it could work fine.
> > > > >
> > > > > On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney <[email protected]>
> > > > wrote:
> > > > >
> > > > > > The SparseTensor stuff is something else entirely (that's matrices
> > > > > > where the entries are mostly 0)
> > > > > >
> > > > > > There isn't anything to help you right now aside from dictionary
> > > > > > encoding — if your dictionary has 256 elements or less, you can use
> > > > > > uint8 index type and thus have 1 byte per value. We've discussed
> > > > > > implementing RLE in the project and so if we do that in the future
> > > > > > then a random access data structure could be built on top of RLE
> > (in
> > > > > > principle)
> > > > > >
> > > > > > On Wed, Mar 24, 2021 at 8:53 AM Niranda Perera <
> > > > [email protected]
> > > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Lykov,
> > > > > > >
> > > > > > > I believe there's an arrow sparse tensor abstraction.
> > > > > > >
> > > > > > > On Wed, Mar 24, 2021, 05:05 Kirill Lykov <[email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I wonder if there is an existing way to store floats/ints with
> > > many
> > > > > > > > repetitions in some container (not sure about terminology).
> > > > > > > > For example, I might have data like A=[1, 2, 2, 2, 3, 3, 3, 3,
> > 3,
> > > > 3,
> > > > > > 1, 2]
> > > > > > > > and i would like to store only B=[1, 2, 3, 1, 2] but from user
> > > > > > > > perspective it behaves like container A. I know I can use
> > > > dictionary
> > > > > > but as
> > > > > > > > far I understand it will store internally indices of the chosen
> > > > > > elements so
> > > > > > > > it makes sense more for binary data or structures.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Kirill Lykov
> > > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Kirill Lykov
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Kirill Lykov
> >
>

Re: sparse data array

Reply via email to