Re: Is there anyway to resize record batches

Aldrin Wed, 22 Nov 2023 11:47:55 -0800

Assuming the C++ implementation, Jacek's suggestion (#3 below) is probably 
best. Here is some extra context:


1. You can slice larger RecordBatches [1]
2. You can make a larger RecordBatch [2] from columns of smaller RecordBatches 
[3] probably using the correct type of Builder [4] and with a bit of resistance 
from the various types
3. As Jacek said, you can wrap smaller RecordBatches together as a Table [5], 
combine the chunks [6], and then convert back to RecordBatches using a 
TableBatchReader [7] if necessary
4. I didn't see anything useful in the Compute API for concatenating arbitrary 
Arrays or RecordBatches, but you can use Selection functions [8] instead of 
Slicing for anything that's too big.


[1]: 
https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch5SliceE7int64_t7int64_t

[2]: 
https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatch4MakeENSt10shared_ptrI6SchemaEE7int64_tNSt6vectorINSt10shared_ptrI5ArrayEEEE
[3]: 
https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch6columnEi
[4]: https://arrow.apache.org/docs/cpp/arrays.html#building-an-array

[5]: 
https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5Table17FromRecordBatchesENSt10shared_ptrI6SchemaEERKNSt6vectorINSt10shared_ptrI11RecordBatchEEEE
[6]: 
https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow5Table13CombineChunksEP10MemoryPool
[7]: 
https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow16TableBatchReaderE

[8]: https://arrow.apache.org/docs/cpp/compute.html#selections



# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka 
<jacek.plis...@gmail.com> wrote:


> Hi!
> 

> I think some code is needed for clarity. You can concatenate tables (and
> combine_chunks afterwards) or arrays. Then pass such concatenated one.
> 

> Regards,
> 

> Jacek
> 

> śr., 22 lis 2023 o 19:54 Lee, David (PAG) david....@blackrock.com.invalid
> 

> napisał(a):
> 

> > I've got 36 million rows of data which ends up as a record batch with 3000
> > batches ranging from 12k to 300k rows each. I'm assuming these batches are
> > created using the multithreaded CSV file reader..
> > 

> > Is there anyway to reorg the data into sometime like 36 batches consistent
> > of 1 million rows each?
> > 

> > What I'm seeing when we try to load this data using the ADBC Snowflake
> > driver is that each individual batch is executed as a bind array insert in
> > the Snowflake Go Driver.
> > 3,000 bind array inserts is taking 3 hours..
> > 

> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender immediately
> > and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information. Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> > 

> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> > 

> > © 2023 BlackRock, Inc. All rights reserved.

publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: Is there anyway to resize record batches

Reply via email to