Re: sparse data array

2021-04-01 Thread Micah Kornfield
BTW the closest thing there would be in the spec for achieving this is Dense integers, but unfortunately, those have a 5 byte overhead per slot. So you would need very few integers of 8 bytes and a lot of integers that fit into either 1 or 2 bytes to get any space savings. On Wed, Mar 31, 2021 at

Re: sparse data array

2021-03-31 Thread bobtins
I appreciate the feedback. I realize it's a tricky nut to crack; there's always going to be a desire to use compression to improve scaling, and I was trying to identify a connecting thread between various requests for compression enhancements on this list and my own experience. I'll look at the

Re: sparse data array

2021-03-30 Thread Micah Kornfield
Hi Bob, > I can observe that in a project like Arrow, there is always a tension > between compatibility and extensibility, and it makes me wonder if it would > be helpful to add capabilities without changing the spec. An extension type > can be defined in terms of one of the built-in layouts, but

Re: sparse data array

2021-03-30 Thread bobtins
>From your response, I'm inferring that in order to introduce this kind of >compression, support in the spec is needed, similar to how compression types >and parameters are enumerated in >https://github.com/apache/arrow/blob/master/format/Message.fbs. Any change in >the spec is a Big Deal (and

RE: Re: sparse data array

2021-03-30 Thread Jacob Quinn
> > > On a related note, such encoding would address DataFusion's issue of > > representing scalars / constant arrays: a constant array would be > > represented as a repetition. Currently we just unpack (i.e. allocate) a > > constant array when we want to transfer through a RecordBatch. > In the J

Re: sparse data array

2021-03-27 Thread Wes McKinney
I’ve also heard interest from folks in the academic database community about adding compressed (sparse) in memory structures to the Arrow format / specification, so I think it makes more sense to try to figure things out at the specification / protocol level and then work on an implementation. I ag

Re: sparse data array

2021-03-27 Thread Kirill Lykov
Thanks for the information and ideas, I need to check them out (especially one with structures). PR proposal for RLE is very interesting since internally people express interest in this feature. For intern, I thought to ask to work primarily on data structures level (like array adapter or something

Re: sparse data array

2021-03-26 Thread Micah Kornfield
I made a proposal a while ago that covers a form of RLE encoding [1]. I haven't had time to work on it, since it is a substantial effort to implement. I wouldn't expect an intern to be able to complete the work necessary to get this merged over the course of a normal 3 month internship. [1] http

Re: sparse data array

2021-03-25 Thread Jorge Cardoso Leitão
Would it be an option to use a StructArray for that? One array with the values, and one with the repetitions: Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) -> StructArray([ "values": Int32([1, 2, 3, 1, 2]), "repetitions": UInt32([1, 3, 5, 1, 1]), ]) It does not have the same API, but I thi

Re: sparse data array

2021-03-25 Thread Kirill Lykov
Thanks for the answer. I asked about it because we need it and I was about writing a summer intern proposal for a student to work on it. Looks like it could work fine. On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney wrote: > The SparseTensor stuff is something else entirely (that's matrices > where

Re: sparse data array

2021-03-24 Thread Wes McKinney
The SparseTensor stuff is something else entirely (that's matrices where the entries are mostly 0) There isn't anything to help you right now aside from dictionary encoding — if your dictionary has 256 elements or less, you can use uint8 index type and thus have 1 byte per value. We've discussed i

Re: sparse data array

2021-03-24 Thread Niranda Perera
Hi Lykov, I believe there's an arrow sparse tensor abstraction. On Wed, Mar 24, 2021, 05:05 Kirill Lykov wrote: > Hi, > > I wonder if there is an existing way to store floats/ints with many > repetitions in some container (not sure about terminology). > For example, I might have data like A=[1,

sparse data array

2021-03-24 Thread Kirill Lykov
Hi, I wonder if there is an existing way to store floats/ints with many repetitions in some container (not sure about terminology). For example, I might have data like A=[1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2] and i would like to store only B=[1, 2, 3, 1, 2] but from user perspective it behaves like