[jira] [Commented] (ARROW-5153) [Rust] [Parquet] Use IntoIter trait for write_batch/write_mini_batch

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:50:38 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662176#comment-17662176
 ]


Rok Mihevc commented on ARROW-5153:
-----------------------------------

This issue has been migrated to [issue 
#21633|https://github.com/apache/arrow/issues/21633] on GitHub. Please see the 
[migration documentation|https://github.com/apache/arrow/issues/14542] for 
further details.

> [Rust] [Parquet] Use IntoIter trait for write_batch/write_mini_batch
> --------------------------------------------------------------------
>
>                 Key: ARROW-5153
>                 URL: https://issues.apache.org/jira/browse/ARROW-5153
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Xavier Lange
>            Priority: Major
>
> Writing data to a parquet file requires a lot of copying and intermediate Vec 
> creation. Take a record struct like:
> {{struct MyData {}}{{  name: String,}}{{  address: Option<String>}}{{}}}
> Over the course of working sets of this data, you'll have the bulk data 
> Vec<MyData>,  the names column in a Vec<&String>, the address column in a 
> Vec<Option<String>>. This puts extra memory pressure on the system, at the 
> minimum we have to allocate a Vec the same size as the bulk data even if we 
> are using references.
> What I'm proposing is to use an IntoIter style. This will maintain backward 
> compat as a slice automatically implements IntoIter. Where 
> ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: 
> IntoIter<Item=T::T>". Then you can do things like
> {{  write_batch(bulk.iter().map(|x| x.name), None, None)}}{{  
> write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| 
> x.is_some())), None)}}
> and you can see there's no need for an intermediate Vec, so no short-term 
> allocations to write out the data.
> I am writing data with many columns and I think this would really help to 
> speed things up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-5153) [Rust] [Parquet] Use IntoIter trait for write_batch/write_mini_batch

Reply via email to