BTW the closest thing there would be in the spec for achieving this is
Dense integers, but unfortunately, those have a 5 byte overhead per slot.
So you would need very few integers of 8 bytes and a lot of integers that
fit into either 1 or 2 bytes to get any space savings.
On Wed, Mar 31, 2021 at
I appreciate the feedback. I realize it's a tricky nut to crack; there's always
going to be a desire to use compression to improve scaling, and I was trying to
identify a connecting thread between various requests for compression
enhancements on this list and my own experience. I'll look at the
Hi Bob,
> I can observe that in a project like Arrow, there is always a tension
> between compatibility and extensibility, and it makes me wonder if it would
> be helpful to add capabilities without changing the spec. An extension type
> can be defined in terms of one of the built-in layouts, but
>From your response, I'm inferring that in order to introduce this kind of
>compression, support in the spec is needed, similar to how compression types
>and parameters are enumerated in
>https://github.com/apache/arrow/blob/master/format/Message.fbs. Any change in
>the spec is a Big Deal (and
>
> > On a related note, such encoding would address DataFusion's issue of
> > representing scalars / constant arrays: a constant array would be
> > represented as a repetition. Currently we just unpack (i.e. allocate) a
> > constant array when we want to transfer through a RecordBatch.
>
In the J
I’ve also heard interest from folks in the academic database community
about adding compressed (sparse) in memory structures to the Arrow format /
specification, so I think it makes more sense to try to figure things out
at the specification / protocol level and then work on an implementation. I
ag
Thanks for the information and ideas, I need to check them out (especially
one with structures).
PR proposal for RLE is very interesting since internally people express
interest in this feature.
For intern, I thought to ask to work primarily on data structures level
(like array adapter or something
I made a proposal a while ago that covers a form of RLE encoding [1]. I
haven't had time to work on it, since it is a substantial effort to
implement.
I wouldn't expect an intern to be able to complete the work necessary to
get this merged over the course of a normal 3 month internship.
[1] http
Would it be an option to use a StructArray for that? One array with the
values, and one with the repetitions:
Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
StructArray([
"values": Int32([1, 2, 3, 1, 2]),
"repetitions": UInt32([1, 3, 5, 1, 1]),
])
It does not have the same API, but I thi
Thanks for the answer.
I asked about it because we need it and I was about writing a summer intern
proposal for a student to work on it.
Looks like it could work fine.
On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney wrote:
> The SparseTensor stuff is something else entirely (that's matrices
> where
The SparseTensor stuff is something else entirely (that's matrices
where the entries are mostly 0)
There isn't anything to help you right now aside from dictionary
encoding — if your dictionary has 256 elements or less, you can use
uint8 index type and thus have 1 byte per value. We've discussed
i
Hi Lykov,
I believe there's an arrow sparse tensor abstraction.
On Wed, Mar 24, 2021, 05:05 Kirill Lykov wrote:
> Hi,
>
> I wonder if there is an existing way to store floats/ints with many
> repetitions in some container (not sure about terminology).
> For example, I might have data like A=[1,
Hi,
I wonder if there is an existing way to store floats/ints with many
repetitions in some container (not sure about terminology).
For example, I might have data like A=[1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]
and i would like to store only B=[1, 2, 3, 1, 2] but from user
perspective it behaves like
13 matches
Mail list logo