Re: Is there anyway to resize record batches

David Li Mon, 27 Nov 2023 08:21:02 -0800

Following up here, Dewey pointed out that the "right" way to do this would be 
to use Snowflake's own bulk ingestion support. I filed 
https://github.com/apache/arrow-adbc/issues/1327


On Wed, Nov 22, 2023, at 16:06, Lee, David (PAG) wrote:
> I got this working by re-organizing vectors into 1 million row each.
>
> My Snowflake bulk insert now takes 3 minutes vs 3 hours.. I'll open a 
> ticket in ADBC to improve the interface..
>
> ADBC's adbc_ingest() function needs something similar to 
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html
>  
> which supports
> max_partitions=None, max_open_files=None, max_rows_per_file=None, 
> min_rows_per_group=None, max_rows_per_group=None,
>
> new_recordbatch = {column: [] for column in my_data.schema.names}
> new_batches = []
> rows = 0
>
> for batch in my_data.to_batches():
>     for column in my_data.schema.names:
>         new_recordbatch[column].append(batch[column])
>     if rows < 1000000:
>         rows += batch.num_rows
>     else:
>         print(rows)
>         new_arrays = []
>         for column in my_data.schema.names:
>             new_arrays.append(pa.concat_arrays(new_recordbatch[column]))
>         new_batches.append(pa.RecordBatch.from_arrays(new_arrays, 
> schema=my_data.schema))
>         new_recordbatch = {column: [] for column in 
> my_data.schema.names}
>         rows = 0
>
> if rows > 0:
>     new_arrays = []
>     for column in my_data.schema.names:
>         new_arrays.append(pa.concat_arrays(new_recordbatch[column]))
>     new_batches.append(pa.RecordBatch.from_arrays(new_arrays, 
> schema=my_data.schema))
>
> new_table = pa.Table.from_batches(new_batches)
>
> cursor = adbc_conn.cursor()
> cursor.adbc_ingest(table_name="xyz", data=new_table, mode="append")
> cursor.execute("commit")
>
> -----Original Message-----
> From: Aldrin <octalene....@pm.me.INVALID> 
> Sent: Wednesday, November 22, 2023 12:36 PM
> To: dev@arrow.apache.org
> Subject: Re: Is there anyway to resize record batches
>
> As far as I understand, that bundles the Arrays into a ChunkedArray 
> which only Table interacts with. It doesn't make a longer Array and 
> depending on what the ADBC Snowflake driver is doing that may or may 
> not help with the number of invocations that are happening.
>
> Also, its not portable across implementations since ChunkedArray is not 
> part of the specification, though I am optimistic that if you pass 
> ChunkedArray to a different implementation then the C++ implementation 
> could consolidate it as a single Array.
>
>
>
>
> # ------------------------------
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> On Wednesday, November 22nd, 2023 at 11:58, Jacek Pliszka 
> <jacek.plis...@gmail.com> wrote:
>
>
>> Re 4. you create ChunkedArray from Array.
>> 
>
>> BR
>> 
>
>> J
>> 
>
>> śr., 22 lis 2023 o 20:48 Aldrin octalene....@pm.me.invalid napisał(a):
>> 
>
>> > Assuming the C++ implementation, Jacek's suggestion (#3 below) is 
>> > probably best. Here is some extra context:
>> > 
>
>> > 1. You can slice larger RecordBatches 1 2. You can make a larger 
>> > RecordBatch 2 from columns of smaller RecordBatches 3 probably using 
>> > the correct type of Builder 4 and with a bit of resistance from the 
>> > various types 3. As Jacek said, you can wrap smaller RecordBatches 
>> > together as a Table 5, combine the chunks 6, and then convert back 
>> > to RecordBatches using a TableBatchReader 7 if necessary 4. I didn't 
>> > see anything useful in the Compute API for concatenating arbitrary 
>> > Arrays or RecordBatches, but you can use Selection functions 8 
>> > instead of Slicing for anything that's too big.
>> > 
>
>> > # ------------------------------
>> > 
>
>> > # Aldrin
>> > 
>
>> > https://github.com/drin/
>> > 
>
>> > https://gitlab.com/octalene
>> > 
>
>> > https://keybase.io/octalene
>> > 
>
>> > On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka < 
>> > jacek.plis...@gmail.com> wrote:
>> > 
>
>> > > Hi!
>> > 
>
>> > > I think some code is needed for clarity. You can concatenate 
>> > > tables (and combine_chunks afterwards) or arrays. Then pass such 
>> > > concatenated one.
>> > 
>
>> > > Regards,
>> > 
>
>> > > Jacek
>> > 
>
>> > > śr., 22 lis 2023 o 19:54 Lee, David (PAG) david....@blackrock.com 
>> > > .invalid
>> > 
>
>> > > napisał(a):
>> > 
>
>> > > > I've got 36 million rows of data which ends up as a record batch 
>> > > > with
>> > > > 3000
>> > > > batches ranging from 12k to 300k rows each. I'm assuming these 
>> > > > batches are created using the multithreaded CSV file reader..
>> > 
>
>> > > > Is there anyway to reorg the data into sometime like 36 batches 
>> > > > consistent of 1 million rows each?
>> > 
>
>> > > > What I'm seeing when we try to load this data using the ADBC 
>> > > > Snowflake driver is that each individual batch is executed as a 
>> > > > bind array insert in the Snowflake Go Driver.
>> > > > 3,000 bind array inserts is taking 3 hours..
>> > 
>
>> > > > This message may contain information that is confidential or 
>> > > > privileged.
>> > > > If you are not the intended recipient, please advise the sender 
>> > > > immediately and delete this message. See 
>> > > > http://www.blackrock.com/corporate/compliance/email-disclaimers 
>> > > > for further information. Please refer to 
>> > > > http://www.blackrock.com/corporate/compliance/privacy-policy for 
>> > > > more information about BlackRock’s Privacy Policy.
>> > 
>
>> > > > For a list of BlackRock's office addresses worldwide, see 
>> > > > http://www.blackrock.com/corporate/about-us/contacts-locations.
>> > 
>
>> > > > © 2023 BlackRock, Inc. All rights reserved.

Re: Is there anyway to resize record batches

Reply via email to