Re: sparse data array

bobtins Wed, 31 Mar 2021 11:04:31 -0700

I appreciate the feedback. I realize it's a tricky nut to crack; there's always 
going to be a desire to use compression to improve scaling, and I was trying to 
identify a connecting thread between various requests for compression 
enhancements on this list and my own experience. I'll look at the spec again 
and put it on my back burner.


On 2021/03/31 04:03:07, Micah Kornfield <[email protected]> wrote: 
> Hi Bob,
> 
> 
> > I can observe that in a project like Arrow, there is always a tension
> > between compatibility and extensibility, and it makes me wonder if it would
> > be helpful to add capabilities without changing the spec. An extension type
> > can be defined in terms of one of the built-in layouts, but it would define
> > semantics (such as compression) that would be used to interpret that layout.
> 
> 
> I'm not sure if this is referring to existing extension types [1] but I
> believe  they are insufficient for this purpose.  The compression
> techniques being discussed don't work well, because compression violates
> the fundamental assumptions of the existing protocols.  Each array is
> expected to have an equal number of slots.  So an array compressed as a
> struct, would cause misalignment with non-encoded arrays.
> 
> For example, integers are stored in blocks of 4096 values, with each block
> > the minimum size to hold all the values. You access the value n with the
> > expression "block[n >> 12][n % 4096]".
> > Take an example with 1M int32 values. Value 0 is 1e9 but all the others
> > are 0 to 9.
> > Normally you would use 4M bytes to store these, but you could instead have
> > 1 block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage for
> > block offsets, so a savings of almost 75%. If you could have uint4 blocks
> > you could save about 87%.
> 
> 
> This is difficult with the existing RecordBatch stream approach  since
> schema is fixed ahead of time.  One could theoretically standardize a
> notion of schema replacement in communications.  The I linked takes a
> different approach and allows for adjusting encodings on a per message
> basis.  Both are potentially viable.
> 
> [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> 
> On Tue, Mar 30, 2021 at 5:09 PM bobtins <[email protected]> wrote:
> 
> > From your response, I'm inferring that in order to introduce this kind of
> > compression, support in the spec is needed, similar to how compression
> > types and parameters are enumerated in
> > https://github.com/apache/arrow/blob/master/format/Message.fbs. Any
> > change in the spec is a Big Deal (and it should be).
> >
> > I can observe that in a project like Arrow, there is always a tension
> > between compatibility and extensibility, and it makes me wonder if it would
> > be helpful to add capabilities without changing the spec. An extension type
> > can be defined in terms of one of the built-in layouts, but it would define
> > semantics (such as compression) that would be used to interpret that
> > layout.
> >
> > > > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Would it be an option to use a StructArray for that? One array
> > with the
> > > > > > values, and one with the repetitions:
> > > > > >
> > > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > > > > >
> > > > > > StructArray([
> > > > > >     "values": Int32([1, 2, 3, 1, 2]),
> > > > > >     "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > > > > ])
> > > > > >
> > > > > > It does not have the same API, but I think that the physical
> > operations
> > > > > > would be different, anyways: ("array + 2" would only operate on
> > > > > "values").
> > > > > > I think that a small struct / object with some operator overloading
> > > > would
> > > > > > address this, and writing something on the metadata would allow
> > others
> > > > to
> > > > > > consume it, a-la extension type?
> > > > > >
> > > > > > On a related note, such encoding would address DataFusion's issue
> > of
> > > > > > representing scalars / constant arrays: a constant array would be
> > > > > > represented as a repetition. Currently we just unpack (i.e.
> > allocate) a
> > > > > > constant array when we want to transfer through a RecordBatch.
> > > > > >
> >
> > I just reread the whole thread, and realized Jorge was saying a similar
> > thing, that this new type could be built on an existing layout. But I guess
> > I'm also imagining some more general capabilities:
> >
> > * Define an extension type in terms of an existing layout
> > * Register an extension type implementation
> > * Enumerate available extension types
> >
> > It would make life more complicated, for sure, but it would allow things
> > like compression to evolve more quickly. I'm thinking about the in-memory
> > implementation that I built, where I did various things to save memory.
> >
> > For example, integers are stored in blocks of 4096 values, with each block
> > the minimum size to hold all the values. You access the value n with the
> > expression "block[n >> 12][n % 4096]".
> > Take an example with 1M int32 values. Value 0 is 1e9 but all the others
> > are 0 to 9.
> > Normally you would use 4M bytes to store these, but you could instead have
> > 1 block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage for
> > block offsets, so a savings of almost 75%. If you could have uint4 blocks
> > you could save about 87%.
> >
> > On the other hand, I had the luxury of total control over the
> > implementation, so it was easy for me to try something and make it work.
> > For Arrow, having the memory layout standardized makes it easy to implement
> > various SIMD and GPU optimizations, which all go out the window if you
> > allow arbitrary new data access semantics.
> >
> > So if this idea is hopelessly naive and half-baked and would result in a
> > train wreck, please feel free to enlighten me/rip me a new one--I'm here to
> > learn.
> >
> >
> >
> > On 2021/03/27 17:35:32, Wes McKinney <[email protected]> wrote:
> > > I’ve also heard interest from folks in the academic database community
> > > about adding compressed (sparse) in memory structures to the Arrow
> > format /
> > > specification, so I think it makes more sense to try to figure things out
> > > at the specification / protocol level and then work on an
> > implementation. I
> > > agree this seems above and beyond what I would think an intern could
> > > accomplish in a 10-12 week period given the process that has been
> > involved
> > > with other significant additions to Arrow over the last several years.
> > >
> > > On Sat, Mar 27, 2021 at 9:40 AM Kirill Lykov <[email protected]>
> > wrote:
> > >
> > > > Thanks for the information and ideas, I need to check them out
> > (especially
> > > > one with structures).
> > > > PR proposal for RLE is very interesting since internally people express
> > > > interest in this feature.
> > > > For intern, I thought to ask to work primarily on data structures level
> > > > (like array adapter or something like that).
> > > > So I haven't thought about communication layer, but it is a useful
> > feature
> > > > per se.
> > > > However it might have limited value in terms of contribution to Arrow
> > and,
> > > > hence, not that attractive for an intern.
> > > >
> > > > On Sat, Mar 27, 2021 at 12:50 AM Micah Kornfield <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > I made a proposal a while ago that covers a form of RLE encoding
> > [1].  I
> > > > > haven't had time to work on it, since it is a substantial effort to
> > > > > implement.
> > > > >
> > > > > I wouldn't expect an intern to be able to complete the work
> > necessary to
> > > > > get this merged over the course of a normal 3 month internship.
> > > > >
> > > > > [1] https://github.com/apache/arrow/pull/4815/files
> > > > >
> > > > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Would it be an option to use a StructArray for that? One array
> > with the
> > > > > > values, and one with the repetitions:
> > > > > >
> > > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > > > > >
> > > > > > StructArray([
> > > > > >     "values": Int32([1, 2, 3, 1, 2]),
> > > > > >     "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > > > > ])
> > > > > >
> > > > > > It does not have the same API, but I think that the physical
> > operations
> > > > > > would be different, anyways: ("array + 2" would only operate on
> > > > > "values").
> > > > > > I think that a small struct / object with some operator overloading
> > > > would
> > > > > > address this, and writing something on the metadata would allow
> > others
> > > > to
> > > > > > consume it, a-la extension type?
> > > > > >
> > > > > > On a related note, such encoding would address DataFusion's issue
> > of
> > > > > > representing scalars / constant arrays: a constant array would be
> > > > > > represented as a repetition. Currently we just unpack (i.e.
> > allocate) a
> > > > > > constant array when we want to transfer through a RecordBatch.
> > > > > >
> > > > > > Best,
> > > > > > Jorge
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Mar 25, 2021, 10:03 Kirill Lykov <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > Thanks for the answer.
> > > > > > > I asked about it because we need it and I was about writing a
> > summer
> > > > > > intern
> > > > > > > proposal for a student to work on it.
> > > > > > > Looks like it could work fine.
> > > > > > >
> > > > > > > On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > The SparseTensor stuff is something else entirely (that's
> > matrices
> > > > > > > > where the entries are mostly 0)
> > > > > > > >
> > > > > > > > There isn't anything to help you right now aside from
> > dictionary
> > > > > > > > encoding — if your dictionary has 256 elements or less, you
> > can use
> > > > > > > > uint8 index type and thus have 1 byte per value. We've
> > discussed
> > > > > > > > implementing RLE in the project and so if we do that in the
> > future
> > > > > > > > then a random access data structure could be built on top of
> > RLE
> > > > (in
> > > > > > > > principle)
> > > > > > > >
> > > > > > > > On Wed, Mar 24, 2021 at 8:53 AM Niranda Perera <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Lykov,
> > > > > > > > >
> > > > > > > > > I believe there's an arrow sparse tensor abstraction.
> > > > > > > > >
> > > > > > > > > On Wed, Mar 24, 2021, 05:05 Kirill Lykov <
> > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I wonder if there is an existing way to store floats/ints
> > with
> > > > > many
> > > > > > > > > > repetitions in some container (not sure about terminology).
> > > > > > > > > > For example, I might have data like A=[1, 2, 2, 2, 3, 3,
> > 3, 3,
> > > > 3,
> > > > > > 3,
> > > > > > > > 1, 2]
> > > > > > > > > > and i would like to store only B=[1, 2, 3, 1, 2] but from
> > user
> > > > > > > > > > perspective it behaves like container A. I know I can use
> > > > > > dictionary
> > > > > > > > but as
> > > > > > > > > > far I understand it will store internally indices of the
> > chosen
> > > > > > > > elements so
> > > > > > > > > > it makes sense more for binary data or structures.
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > > Kirill Lykov
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Kirill Lykov
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Kirill Lykov
> > > >
> > >
> >
>

Re: sparse data array

Reply via email to