[ https://issues.apache.org/jira/browse/ARROW-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662176#comment-17662176 ]
Rok Mihevc commented on ARROW-5153: ----------------------------------- This issue has been migrated to [issue #21633|https://github.com/apache/arrow/issues/21633] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Rust] [Parquet] Use IntoIter trait for write_batch/write_mini_batch > -------------------------------------------------------------------- > > Key: ARROW-5153 > URL: https://issues.apache.org/jira/browse/ARROW-5153 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust > Reporter: Xavier Lange > Priority: Major > > Writing data to a parquet file requires a lot of copying and intermediate Vec > creation. Take a record struct like: > {{struct MyData {}}{{ name: String,}}{{ address: Option<String>}}{{}}} > Over the course of working sets of this data, you'll have the bulk data > Vec<MyData>, the names column in a Vec<&String>, the address column in a > Vec<Option<String>>. This puts extra memory pressure on the system, at the > minimum we have to allocate a Vec the same size as the bulk data even if we > are using references. > What I'm proposing is to use an IntoIter style. This will maintain backward > compat as a slice automatically implements IntoIter. Where > ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: > IntoIter<Item=T::T>". Then you can do things like > {{ write_batch(bulk.iter().map(|x| x.name), None, None)}}{{ > write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| > x.is_some())), None)}} > and you can see there's no need for an intermediate Vec, so no short-term > allocations to write out the data. > I am writing data with many columns and I think this would really help to > speed things up. -- This message was sent by Atlassian Jira (v8.20.10#820010)