Hi all, Is there an efficient way to merge a set of sorted tables/ record batches (like in the merge step of a merge sort)? Ex: Simplest case would be to merge these 2 tables, based on A
Table 0 Table 1 A B A B 0 3 x 0 1 x 1 3 x 1 2 x 2 4 x 2 2 x 3 5 x 3 5 x I could think of a couple of ways 1. Use a min heap with A values/ indices and append (unsafe) each column value to a corresponding ArrayBuilder. (This might not be very cache efficient, because we are going across the columns, and worst-case would be appending 1 element at a time) 2. Use a min heap with A values/ indices and create a result index array (indices would be offset with the prefix sum). and then concat tables and call `take` using the result index array. (I believe concat would create a chunked array for each column, but IINM take on chunked arrays would merge chunks before calling take[1] . So, this would also be problematic IMO) Any thoughts on these approaches? Is there a better way to do this? [1] https://github.com/apache/arrow/blob/542e81b6dea62f90817b117b1cb1b2de953f293e/cpp/src/arrow/compute/kernels/vector_selection.cc#L2016 Best -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>