Re: Is there anyway to resize record batches

Aldrin Wed, 22 Nov 2023 12:37:24 -0800

As far as I understand, that bundles the Arrays into a ChunkedArray which only 
Table interacts with. It doesn't make a longer Array and depending on what the 
ADBC Snowflake driver is doing that may or may not help with the number of 
invocations that are happening.


Also, its not portable across implementations since ChunkedArray is not part of 
the specification, though I am optimistic that if you pass ChunkedArray to a 
different implementation then the C++ implementation could consolidate it as a 
single Array.




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Wednesday, November 22nd, 2023 at 11:58, Jacek Pliszka 
<jacek.plis...@gmail.com> wrote:


> Re 4. you create ChunkedArray from Array.
> 

> BR
> 

> J
> 

> śr., 22 lis 2023 o 20:48 Aldrin octalene....@pm.me.invalid napisał(a):
> 

> > Assuming the C++ implementation, Jacek's suggestion (#3 below) is probably
> > best. Here is some extra context:
> > 

> > 1. You can slice larger RecordBatches 1
> > 2. You can make a larger RecordBatch 2 from columns of smaller
> > RecordBatches 3 probably using the correct type of Builder 4 and with a
> > bit of resistance from the various types
> > 3. As Jacek said, you can wrap smaller RecordBatches together as a Table
> > 5, combine the chunks 6, and then convert back to RecordBatches using a
> > TableBatchReader 7 if necessary
> > 4. I didn't see anything useful in the Compute API for concatenating
> > arbitrary Arrays or RecordBatches, but you can use Selection functions 8
> > instead of Slicing for anything that's too big.
> > 

> > # ------------------------------
> > 

> > # Aldrin
> > 

> > https://github.com/drin/
> > 

> > https://gitlab.com/octalene
> > 

> > https://keybase.io/octalene
> > 

> > On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka <
> > jacek.plis...@gmail.com> wrote:
> > 

> > > Hi!
> > 

> > > I think some code is needed for clarity. You can concatenate tables (and
> > > combine_chunks afterwards) or arrays. Then pass such concatenated one.
> > 

> > > Regards,
> > 

> > > Jacek
> > 

> > > śr., 22 lis 2023 o 19:54 Lee, David (PAG) david....@blackrock.com
> > > .invalid
> > 

> > > napisał(a):
> > 

> > > > I've got 36 million rows of data which ends up as a record batch with
> > > > 3000
> > > > batches ranging from 12k to 300k rows each. I'm assuming these batches
> > > > are
> > > > created using the multithreaded CSV file reader..
> > 

> > > > Is there anyway to reorg the data into sometime like 36 batches
> > > > consistent
> > > > of 1 million rows each?
> > 

> > > > What I'm seeing when we try to load this data using the ADBC Snowflake
> > > > driver is that each individual batch is executed as a bind array
> > > > insert in
> > > > the Snowflake Go Driver.
> > > > 3,000 bind array inserts is taking 3 hours..
> > 

> > > > This message may contain information that is confidential or
> > > > privileged.
> > > > If you are not the intended recipient, please advise the sender
> > > > immediately
> > > > and delete this message. See
> > > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > > > further information. Please refer to
> > > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > > > information about BlackRock’s Privacy Policy.
> > 

> > > > For a list of BlackRock's office addresses worldwide, see
> > > > http://www.blackrock.com/corporate/about-us/contacts-locations.
> > 

> > > > © 2023 BlackRock, Inc. All rights reserved.

publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: Is there anyway to resize record batches

Reply via email to