Assuming the C++ implementation, Jacek's suggestion (#3 below) is probably best. Here is some extra context:
1. You can slice larger RecordBatches [1] 2. You can make a larger RecordBatch [2] from columns of smaller RecordBatches [3] probably using the correct type of Builder [4] and with a bit of resistance from the various types 3. As Jacek said, you can wrap smaller RecordBatches together as a Table [5], combine the chunks [6], and then convert back to RecordBatches using a TableBatchReader [7] if necessary 4. I didn't see anything useful in the Compute API for concatenating arbitrary Arrays or RecordBatches, but you can use Selection functions [8] instead of Slicing for anything that's too big. [1]: https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch5SliceE7int64_t7int64_t [2]: https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatch4MakeENSt10shared_ptrI6SchemaEE7int64_tNSt6vectorINSt10shared_ptrI5ArrayEEEE [3]: https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch6columnEi [4]: https://arrow.apache.org/docs/cpp/arrays.html#building-an-array [5]: https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5Table17FromRecordBatchesENSt10shared_ptrI6SchemaEERKNSt6vectorINSt10shared_ptrI11RecordBatchEEEE [6]: https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow5Table13CombineChunksEP10MemoryPool [7]: https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow16TableBatchReaderE [8]: https://arrow.apache.org/docs/cpp/compute.html#selections # ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka <jacek.plis...@gmail.com> wrote: > Hi! > > I think some code is needed for clarity. You can concatenate tables (and > combine_chunks afterwards) or arrays. Then pass such concatenated one. > > Regards, > > Jacek > > śr., 22 lis 2023 o 19:54 Lee, David (PAG) david....@blackrock.com.invalid > > napisał(a): > > > I've got 36 million rows of data which ends up as a record batch with 3000 > > batches ranging from 12k to 300k rows each. I'm assuming these batches are > > created using the multithreaded CSV file reader.. > > > > Is there anyway to reorg the data into sometime like 36 batches consistent > > of 1 million rows each? > > > > What I'm seeing when we try to load this data using the ADBC Snowflake > > driver is that each individual batch is executed as a bind array insert in > > the Snowflake Go Driver. > > 3,000 bind array inserts is taking 3 hours.. > > > > This message may contain information that is confidential or privileged. > > If you are not the intended recipient, please advise the sender immediately > > and delete this message. See > > http://www.blackrock.com/corporate/compliance/email-disclaimers for > > further information. Please refer to > > http://www.blackrock.com/corporate/compliance/privacy-policy for more > > information about BlackRock’s Privacy Policy. > > > > For a list of BlackRock's office addresses worldwide, see > > http://www.blackrock.com/corporate/about-us/contacts-locations. > > > > © 2023 BlackRock, Inc. All rights reserved.
publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature