Hey, Im trying to figure out how to merge multiple recordbatches in order to 
optimize overly-chunked tables.
A bit of background here... we have a process that is streaming table rows with 
a batch size of 1 ( because we want to ensure updates are written out in case 
of a crash ).  We also have some code that reads this table on startup.
Our reading code has logic to access a specific row of a table, which this 
startup code does.  To access a specific row you need to iterate through all 
chunks to find the right one.  We're hitting a bottle neck on this specific 
file since it has a chunk size of 1.  Simplest solution for us would be to 
merge all the chunked data into one chunk on startup when we read in the arrow 
file.  We've tried to find a way to do this using the arrow c++ library / 
documents but cant seem to find a clean approach.
Is there any clean way to do this?  Any other possible suggestions?

Side note - we did notice theres some method called "RechunkArraysConsistently" 
.  We couldn't find much info on it, but if that somehow ensures all chunks are 
of the same size and we can re-chunk the columns, then row access would be a 
quick calc ( if all chunks are the same size computing chunk / row in chunk is 
quick )


Thanks
- Rob





DISCLAIMER: This e-mail message and any attachments are intended solely for the 
use of the individual or entity to which it is addressed and may contain 
information that is confidential or legally privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, copying or other use of this message or its attachments is 
strictly prohibited. If you have received this message in error, please notify 
the sender immediately and permanently delete this message and any attachments.



Reply via email to