Re: Is there anyway to resize record batches

Jacek Pliszka Wed, 22 Nov 2023 11:58:58 -0800

Re 4. you create ChunkedArray from Array.

BR


J

śr., 22 lis 2023 o 20:48 Aldrin <octalene....@pm.me.invalid> napisał(a):

> Assuming the C++ implementation, Jacek's suggestion (#3 below) is probably
> best. Here is some extra context:
>
> 1. You can slice larger RecordBatches [1]
> 2. You can make a larger RecordBatch [2] from columns of smaller
> RecordBatches [3] probably using the correct type of Builder [4] and with a
> bit of resistance from the various types
> 3. As Jacek said, you can wrap smaller RecordBatches together as a Table
> [5], combine the chunks [6], and then convert back to RecordBatches using a
> TableBatchReader [7] if necessary
> 4. I didn't see anything useful in the Compute API for concatenating
> arbitrary Arrays or RecordBatches, but you can use Selection functions [8]
> instead of Slicing for anything that's too big.
>
>
> [1]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch5SliceE7int64_t7int64_t
>
> [2]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatch4MakeENSt10shared_ptrI6SchemaEE7int64_tNSt6vectorINSt10shared_ptrI5ArrayEEEE
> [3]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch6columnEi
> [4]: https://arrow.apache.org/docs/cpp/arrays.html#building-an-array
>
> [5]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5Table17FromRecordBatchesENSt10shared_ptrI6SchemaEERKNSt6vectorINSt10shared_ptrI11RecordBatchEEEE
> [6]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow5Table13CombineChunksEP10MemoryPool
> [7]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow16TableBatchReaderE
>
> [8]: https://arrow.apache.org/docs/cpp/compute.html#selections
>
>
>
> # ------------------------------
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka <
> jacek.plis...@gmail.com> wrote:
>
>
> > Hi!
> >
>
> > I think some code is needed for clarity. You can concatenate tables (and
> > combine_chunks afterwards) or arrays. Then pass such concatenated one.
> >
>
> > Regards,
> >
>
> > Jacek
> >
>
> > śr., 22 lis 2023 o 19:54 Lee, David (PAG) david....@blackrock.com
> .invalid
> >
>
> > napisał(a):
> >
>
> > > I've got 36 million rows of data which ends up as a record batch with
> 3000
> > > batches ranging from 12k to 300k rows each. I'm assuming these batches
> are
> > > created using the multithreaded CSV file reader..
> > >
>
> > > Is there anyway to reorg the data into sometime like 36 batches
> consistent
> > > of 1 million rows each?
> > >
>
> > > What I'm seeing when we try to load this data using the ADBC Snowflake
> > > driver is that each individual batch is executed as a bind array
> insert in
> > > the Snowflake Go Driver.
> > > 3,000 bind array inserts is taking 3 hours..
> > >
>
> > > This message may contain information that is confidential or
> privileged.
> > > If you are not the intended recipient, please advise the sender
> immediately
> > > and delete this message. See
> > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > > further information. Please refer to
> > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > > information about BlackRock’s Privacy Policy.
> > >
>
> > > For a list of BlackRock's office addresses worldwide, see
> > > http://www.blackrock.com/corporate/about-us/contacts-locations.
> > >
>
> > > © 2023 BlackRock, Inc. All rights reserved.

Re: Is there anyway to resize record batches

Reply via email to