Re: sparse data array

Micah Kornfield Tue, 30 Mar 2021 21:03:33 -0700

Hi Bob,


> I can observe that in a project like Arrow, there is always a tension
> between compatibility and extensibility, and it makes me wonder if it would
> be helpful to add capabilities without changing the spec. An extension type
> can be defined in terms of one of the built-in layouts, but it would define
> semantics (such as compression) that would be used to interpret that layout.


I'm not sure if this is referring to existing extension types [1] but I
believe  they are insufficient for this purpose.  The compression
techniques being discussed don't work well, because compression violates
the fundamental assumptions of the existing protocols.  Each array is
expected to have an equal number of slots.  So an array compressed as a
struct, would cause misalignment with non-encoded arrays.

For example, integers are stored in blocks of 4096 values, with each block
> the minimum size to hold all the values. You access the value n with the
> expression "block[n >> 12][n % 4096]".
> Take an example with 1M int32 values. Value 0 is 1e9 but all the others
> are 0 to 9.
> Normally you would use 4M bytes to store these, but you could instead have
> 1 block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage for
> block offsets, so a savings of almost 75%. If you could have uint4 blocks
> you could save about 87%.


This is difficult with the existing RecordBatch stream approach  since
schema is fixed ahead of time.  One could theoretically standardize a
notion of schema replacement in communications.  The I linked takes a
different approach and allows for adjusting encodings on a per message
basis.  Both are potentially viable.

[1] https://arrow.apache.org/docs/format/Columnar.html#extension-types

On Tue, Mar 30, 2021 at 5:09 PM bobtins <[email protected]> wrote:

> From your response, I'm inferring that in order to introduce this kind of
> compression, support in the spec is needed, similar to how compression
> types and parameters are enumerated in
> https://github.com/apache/arrow/blob/master/format/Message.fbs. Any
> change in the spec is a Big Deal (and it should be).
>
> I can observe that in a project like Arrow, there is always a tension
> between compatibility and extensibility, and it makes me wonder if it would
> be helpful to add capabilities without changing the spec. An extension type
> can be defined in terms of one of the built-in layouts, but it would define
> semantics (such as compression) that would be used to interpret that
> layout.
>
> > > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > > > [email protected]> wrote:
> > > >
> > > > > Would it be an option to use a StructArray for that? One array
> with the
> > > > > values, and one with the repetitions:
> > > > >
> > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > > > >
> > > > > StructArray([
> > > > >     "values": Int32([1, 2, 3, 1, 2]),
> > > > >     "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > > > ])
> > > > >
> > > > > It does not have the same API, but I think that the physical
> operations
> > > > > would be different, anyways: ("array + 2" would only operate on
> > > > "values").
> > > > > I think that a small struct / object with some operator overloading
> > > would
> > > > > address this, and writing something on the metadata would allow
> others
> > > to
> > > > > consume it, a-la extension type?
> > > > >
> > > > > On a related note, such encoding would address DataFusion's issue
> of
> > > > > representing scalars / constant arrays: a constant array would be
> > > > > represented as a repetition. Currently we just unpack (i.e.
> allocate) a
> > > > > constant array when we want to transfer through a RecordBatch.
> > > > >
>
> I just reread the whole thread, and realized Jorge was saying a similar
> thing, that this new type could be built on an existing layout. But I guess
> I'm also imagining some more general capabilities:
>
> * Define an extension type in terms of an existing layout
> * Register an extension type implementation
> * Enumerate available extension types
>
> It would make life more complicated, for sure, but it would allow things
> like compression to evolve more quickly. I'm thinking about the in-memory
> implementation that I built, where I did various things to save memory.
>
> For example, integers are stored in blocks of 4096 values, with each block
> the minimum size to hold all the values. You access the value n with the
> expression "block[n >> 12][n % 4096]".
> Take an example with 1M int32 values. Value 0 is 1e9 but all the others
> are 0 to 9.
> Normally you would use 4M bytes to store these, but you could instead have
> 1 block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage for
> block offsets, so a savings of almost 75%. If you could have uint4 blocks
> you could save about 87%.
>
> On the other hand, I had the luxury of total control over the
> implementation, so it was easy for me to try something and make it work.
> For Arrow, having the memory layout standardized makes it easy to implement
> various SIMD and GPU optimizations, which all go out the window if you
> allow arbitrary new data access semantics.
>
> So if this idea is hopelessly naive and half-baked and would result in a
> train wreck, please feel free to enlighten me/rip me a new one--I'm here to
> learn.
>
>
>
> On 2021/03/27 17:35:32, Wes McKinney <[email protected]> wrote:
> > I’ve also heard interest from folks in the academic database community
> > about adding compressed (sparse) in memory structures to the Arrow
> format /
> > specification, so I think it makes more sense to try to figure things out
> > at the specification / protocol level and then work on an
> implementation. I
> > agree this seems above and beyond what I would think an intern could
> > accomplish in a 10-12 week period given the process that has been
> involved
> > with other significant additions to Arrow over the last several years.
> >
> > On Sat, Mar 27, 2021 at 9:40 AM Kirill Lykov <[email protected]>
> wrote:
> >
> > > Thanks for the information and ideas, I need to check them out
> (especially
> > > one with structures).
> > > PR proposal for RLE is very interesting since internally people express
> > > interest in this feature.
> > > For intern, I thought to ask to work primarily on data structures level
> > > (like array adapter or something like that).
> > > So I haven't thought about communication layer, but it is a useful
> feature
> > > per se.
> > > However it might have limited value in terms of contribution to Arrow
> and,
> > > hence, not that attractive for an intern.
> > >
> > > On Sat, Mar 27, 2021 at 12:50 AM Micah Kornfield <
> [email protected]>
> > > wrote:
> > >
> > > > I made a proposal a while ago that covers a form of RLE encoding
> [1].  I
> > > > haven't had time to work on it, since it is a substantial effort to
> > > > implement.
> > > >
> > > > I wouldn't expect an intern to be able to complete the work
> necessary to
> > > > get this merged over the course of a normal 3 month internship.
> > > >
> > > > [1] https://github.com/apache/arrow/pull/4815/files
> > > >
> > > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > > > [email protected]> wrote:
> > > >
> > > > > Would it be an option to use a StructArray for that? One array
> with the
> > > > > values, and one with the repetitions:
> > > > >
> > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > > > >
> > > > > StructArray([
> > > > >     "values": Int32([1, 2, 3, 1, 2]),
> > > > >     "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > > > ])
> > > > >
> > > > > It does not have the same API, but I think that the physical
> operations
> > > > > would be different, anyways: ("array + 2" would only operate on
> > > > "values").
> > > > > I think that a small struct / object with some operator overloading
> > > would
> > > > > address this, and writing something on the metadata would allow
> others
> > > to
> > > > > consume it, a-la extension type?
> > > > >
> > > > > On a related note, such encoding would address DataFusion's issue
> of
> > > > > representing scalars / constant arrays: a constant array would be
> > > > > represented as a repetition. Currently we just unpack (i.e.
> allocate) a
> > > > > constant array when we want to transfer through a RecordBatch.
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Mar 25, 2021, 10:03 Kirill Lykov <[email protected]>
> > > wrote:
> > > > >
> > > > > > Thanks for the answer.
> > > > > > I asked about it because we need it and I was about writing a
> summer
> > > > > intern
> > > > > > proposal for a student to work on it.
> > > > > > Looks like it could work fine.
> > > > > >
> > > > > > On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > > The SparseTensor stuff is something else entirely (that's
> matrices
> > > > > > > where the entries are mostly 0)
> > > > > > >
> > > > > > > There isn't anything to help you right now aside from
> dictionary
> > > > > > > encoding — if your dictionary has 256 elements or less, you
> can use
> > > > > > > uint8 index type and thus have 1 byte per value. We've
> discussed
> > > > > > > implementing RLE in the project and so if we do that in the
> future
> > > > > > > then a random access data structure could be built on top of
> RLE
> > > (in
> > > > > > > principle)
> > > > > > >
> > > > > > > On Wed, Mar 24, 2021 at 8:53 AM Niranda Perera <
> > > > > [email protected]
> > > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Lykov,
> > > > > > > >
> > > > > > > > I believe there's an arrow sparse tensor abstraction.
> > > > > > > >
> > > > > > > > On Wed, Mar 24, 2021, 05:05 Kirill Lykov <
> [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I wonder if there is an existing way to store floats/ints
> with
> > > > many
> > > > > > > > > repetitions in some container (not sure about terminology).
> > > > > > > > > For example, I might have data like A=[1, 2, 2, 2, 3, 3,
> 3, 3,
> > > 3,
> > > > > 3,
> > > > > > > 1, 2]
> > > > > > > > > and i would like to store only B=[1, 2, 3, 1, 2] but from
> user
> > > > > > > > > perspective it behaves like container A. I know I can use
> > > > > dictionary
> > > > > > > but as
> > > > > > > > > far I understand it will store internally indices of the
> chosen
> > > > > > > elements so
> > > > > > > > > it makes sense more for binary data or structures.
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > > Kirill Lykov
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Kirill Lykov
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Kirill Lykov
> > >
> >
>

Re: sparse data array

Reply via email to