Hi all,

Is there an efficient way to merge a set of sorted tables/ record batches
(like in the merge step of a merge sort)?
Ex: Simplest case would be to merge these 2 tables, based on A

Table 0


Table 1


A B

A B
0 3 x
0 1 x
1 3 x
1 2 x
2 4 x
2 2 x
3 5 x
3 5 x

I could think of a couple of ways
1. Use a min heap with A values/ indices and append (unsafe) each column
value to a corresponding ArrayBuilder. (This might not be very cache
efficient, because we are going across the columns, and worst-case would be
appending 1 element at a time)
2. Use a min heap with A values/ indices and create a result index array
(indices would be offset with the prefix sum). and then concat tables and
call `take` using the result index array. (I believe concat would create a
chunked array for each column, but IINM take on chunked arrays would merge
chunks before calling take[1] . So, this would also be problematic IMO)

Any thoughts on these approaches? Is there a better way to do this?

[1]
https://github.com/apache/arrow/blob/542e81b6dea62f90817b117b1cb1b2de953f293e/cpp/src/arrow/compute/kernels/vector_selection.cc#L2016

Best
-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Reply via email to