Following up here, Dewey pointed out that the "right" way to do this would be to use Snowflake's own bulk ingestion support. I filed https://github.com/apache/arrow-adbc/issues/1327
On Wed, Nov 22, 2023, at 16:06, Lee, David (PAG) wrote: > I got this working by re-organizing vectors into 1 million row each. > > My Snowflake bulk insert now takes 3 minutes vs 3 hours.. I'll open a > ticket in ADBC to improve the interface.. > > ADBC's adbc_ingest() function needs something similar to > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html > > which supports > max_partitions=None, max_open_files=None, max_rows_per_file=None, > min_rows_per_group=None, max_rows_per_group=None, > > new_recordbatch = {column: [] for column in my_data.schema.names} > new_batches = [] > rows = 0 > > for batch in my_data.to_batches(): > for column in my_data.schema.names: > new_recordbatch[column].append(batch[column]) > if rows < 1000000: > rows += batch.num_rows > else: > print(rows) > new_arrays = [] > for column in my_data.schema.names: > new_arrays.append(pa.concat_arrays(new_recordbatch[column])) > new_batches.append(pa.RecordBatch.from_arrays(new_arrays, > schema=my_data.schema)) > new_recordbatch = {column: [] for column in > my_data.schema.names} > rows = 0 > > if rows > 0: > new_arrays = [] > for column in my_data.schema.names: > new_arrays.append(pa.concat_arrays(new_recordbatch[column])) > new_batches.append(pa.RecordBatch.from_arrays(new_arrays, > schema=my_data.schema)) > > new_table = pa.Table.from_batches(new_batches) > > cursor = adbc_conn.cursor() > cursor.adbc_ingest(table_name="xyz", data=new_table, mode="append") > cursor.execute("commit") > > -----Original Message----- > From: Aldrin <octalene....@pm.me.INVALID> > Sent: Wednesday, November 22, 2023 12:36 PM > To: dev@arrow.apache.org > Subject: Re: Is there anyway to resize record batches > > As far as I understand, that bundles the Arrays into a ChunkedArray > which only Table interacts with. It doesn't make a longer Array and > depending on what the ADBC Snowflake driver is doing that may or may > not help with the number of invocations that are happening. > > Also, its not portable across implementations since ChunkedArray is not > part of the specification, though I am optimistic that if you pass > ChunkedArray to a different implementation then the C++ implementation > could consolidate it as a single Array. > > > > > # ------------------------------ > > # Aldrin > > > https://github.com/drin/ > > https://gitlab.com/octalene > > https://keybase.io/octalene > > > On Wednesday, November 22nd, 2023 at 11:58, Jacek Pliszka > <jacek.plis...@gmail.com> wrote: > > >> Re 4. you create ChunkedArray from Array. >> > >> BR >> > >> J >> > >> śr., 22 lis 2023 o 20:48 Aldrin octalene....@pm.me.invalid napisał(a): >> > >> > Assuming the C++ implementation, Jacek's suggestion (#3 below) is >> > probably best. Here is some extra context: >> > > >> > 1. You can slice larger RecordBatches 1 2. You can make a larger >> > RecordBatch 2 from columns of smaller RecordBatches 3 probably using >> > the correct type of Builder 4 and with a bit of resistance from the >> > various types 3. As Jacek said, you can wrap smaller RecordBatches >> > together as a Table 5, combine the chunks 6, and then convert back >> > to RecordBatches using a TableBatchReader 7 if necessary 4. I didn't >> > see anything useful in the Compute API for concatenating arbitrary >> > Arrays or RecordBatches, but you can use Selection functions 8 >> > instead of Slicing for anything that's too big. >> > > >> > # ------------------------------ >> > > >> > # Aldrin >> > > >> > https://github.com/drin/ >> > > >> > https://gitlab.com/octalene >> > > >> > https://keybase.io/octalene >> > > >> > On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka < >> > jacek.plis...@gmail.com> wrote: >> > > >> > > Hi! >> > > >> > > I think some code is needed for clarity. You can concatenate >> > > tables (and combine_chunks afterwards) or arrays. Then pass such >> > > concatenated one. >> > > >> > > Regards, >> > > >> > > Jacek >> > > >> > > śr., 22 lis 2023 o 19:54 Lee, David (PAG) david....@blackrock.com >> > > .invalid >> > > >> > > napisał(a): >> > > >> > > > I've got 36 million rows of data which ends up as a record batch >> > > > with >> > > > 3000 >> > > > batches ranging from 12k to 300k rows each. I'm assuming these >> > > > batches are created using the multithreaded CSV file reader.. >> > > >> > > > Is there anyway to reorg the data into sometime like 36 batches >> > > > consistent of 1 million rows each? >> > > >> > > > What I'm seeing when we try to load this data using the ADBC >> > > > Snowflake driver is that each individual batch is executed as a >> > > > bind array insert in the Snowflake Go Driver. >> > > > 3,000 bind array inserts is taking 3 hours.. >> > > >> > > > This message may contain information that is confidential or >> > > > privileged. >> > > > If you are not the intended recipient, please advise the sender >> > > > immediately and delete this message. See >> > > > http://www.blackrock.com/corporate/compliance/email-disclaimers >> > > > for further information. Please refer to >> > > > http://www.blackrock.com/corporate/compliance/privacy-policy for >> > > > more information about BlackRock’s Privacy Policy. >> > > >> > > > For a list of BlackRock's office addresses worldwide, see >> > > > http://www.blackrock.com/corporate/about-us/contacts-locations. >> > > >> > > > © 2023 BlackRock, Inc. All rights reserved.