RE: Is there anyway to resize record batches

2023-11-22 Thread Lee, David (PAG)
I got this working by re-organizing vectors into 1 million row each. My Snowflake bulk insert now takes 3 minutes vs 3 hours.. I'll open a ticket in ADBC to improve the interface.. ADBC's adbc_ingest() function needs something similar to https://arrow.apache.org/docs/python/generated/pyarrow.da

Re: Is there anyway to resize record batches

2023-11-22 Thread Aldrin
As far as I understand, that bundles the Arrays into a ChunkedArray which only Table interacts with. It doesn't make a longer Array and depending on what the ADBC Snowflake driver is doing that may or may not help with the number of invocations that are happening. Also, its not portable across

Re: Is there anyway to resize record batches

2023-11-22 Thread Jacek Pliszka
Re 4. you create ChunkedArray from Array. BR J śr., 22 lis 2023 o 20:48 Aldrin napisał(a): > Assuming the C++ implementation, Jacek's suggestion (#3 below) is probably > best. Here is some extra context: > > 1. You can slice larger RecordBatches [1] > 2. You can make a larger RecordBatch [2] f

Re: Is there anyway to resize record batches

2023-11-22 Thread Aldrin
Assuming the C++ implementation, Jacek's suggestion (#3 below) is probably best. Here is some extra context: 1. You can slice larger RecordBatches [1] 2. You can make a larger RecordBatch [2] from columns of smaller RecordBatches [3] probably using the correct type of Builder [4] and with a bit

Re: Is there anyway to resize record batches

2023-11-22 Thread Jacek Pliszka
Hi! I think some code is needed for clarity. You can concatenate tables (and combine_chunks afterwards) or arrays. Then pass such concatenated one. Regards, Jacek śr., 22 lis 2023 o 19:54 Lee, David (PAG) napisał(a): > I've got 36 million rows of data which ends up as a record batch with 300

Is there anyway to resize record batches

2023-11-22 Thread Lee, David (PAG)
I've got 36 million rows of data which ends up as a record batch with 3000 batches ranging from 12k to 300k rows each. I'm assuming these batches are created using the multithreaded CSV file reader.. Is there anyway to reorg the data into sometime like 36 batches consistent of 1 million rows e

[DISC] Arrow 14.0.2 patch release

2023-11-22 Thread Raúl Cumplido
Hi, During the last couple of weekly community calls there has been a discussion raised around the necessity of creating a patch release for 14.0.2. There have been several issues tagged with the "backport-candidate" label [1]. >From my understanding the issues are mainly fixing some possible seg

Re: Documentation of Breaking Changes

2023-11-22 Thread Chris Thomas
Having spent a while doing patch and update management for critical infrastructure, I think exposing it directly in the Change Log is the best possible solution. I'll make sure to let the team know that we can look out for future issues with some creative use of GitHub search but more visibility is

Re: Pyarrow minimal build for lambda layers

2023-11-22 Thread Raúl Cumplido
Hi Shara, The example dockerfile installs the base requirements for Ubuntu but then we use the build_venv.sh (or build_conda.sh) to build the Arrow CPP library and then pyarrow [1]. >From the error it seems you did not build Arrow CPP as libarrow.so can't be found. Can you try following the recip

Re: [ANNOUNCE] New Arrow committer: James Duong

2023-11-22 Thread Vibhatha Abeykoon
Congratulations James. With Regards, Vibhatha Abeykoon, PhD On Sat, Nov 18, 2023 at 2:46 AM Ian Cook wrote: > Congratulations James! > > On Thu, Nov 16, 2023 at 3:45 AM Sutou Kouhei wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that James Duong > > has accepted an invitation

Re: Documentation of Breaking Changes

2023-11-22 Thread Raúl Cumplido
Hi Chris, As Bryce pointed out the current process is managed with the manual addition of the `Breaking change` label in GitHub. In general after the Release there is a review process to tag some of those that were missing. Currently you could use the GitHub issue search. For example for 13.0.0 a