[ https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-5123: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21608 > [Rust] derive RecordWriter from struct definitions > -------------------------------------------------- > > Key: ARROW-5123 > URL: https://issues.apache.org/jira/browse/ARROW-5123 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust > Reporter: Xavier Lange > Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 14.5h > Remaining Estimate: 0h > > Migrated from previous github issue (which saw a lot of comments but at a > rough transition time in the project): > https://github.com/sunchao/parquet-rs/pull/197 > > Goal > === > Writing many columns to a file is a chore. If you can put your values in to a > struct which mirrors the schema of your file, this > `derive(ParquetRecordWriter)` will write out all the fields, in the order in > which they are defined, to a row_group. > How to Use > === > ``` > extern crate parquet; > #[macro_use] extern crate parquet_derive; > #[derive(ParquetRecordWriter)] > struct ACompleteRecord<'a> { > pub a_bool: bool, > pub a_str: &'a str, > } > ``` > RecordWriter trait > === > This is the new trait which `parquet_derive` will implement for your structs. > ``` > use super::RowGroupWriter; > pub trait RecordWriter<T> { > fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); > } > ``` > How does it work? > === > The `parquet_derive` crate adds code generating functionality to the rust > compiler. The code generation takes rust syntax and emits additional syntax. > This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, > loaded by the machinery in cargo. Users don't have to do any special > `build.rs` steps or anything like that, it's automatic by including > `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a > section saying as much: > ``` > [lib] > proc-macro = true > ``` > The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to > the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The > `syn` crate parses the struct from a string-representation to a AST (a > recursive enum value). The AST contains all the values I care about when > generating a `RecordWriter` impl: > - the name of the struct > - the lifetime variables of the struct > - the fields of the struct > The fields of the struct are translated from AST to a flat `FieldInfo` > struct. It has the bits I care about for writing a column: `field_name`, > `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. > The code then does the equivalent of templating to build the `RecordWriter` > implementation. The templating functionality is provided by the `quote` > crate. At a high-level the template for `RecordWriter` looks like: > ``` > impl RecordWriter for $struct_name { > fn write_row_group(..) { > $({ > $column_writer_snippet > }) > } > } > ``` > this template is then added under the struct definition, ending up something > like: > ``` > struct MyStruct { > } > impl RecordWriter for MyStruct { > fn write_row_group(..) { > { > write_col_1(); > }; > { > write_col_2(); > } > } > } > ``` > and finally _THIS_ is the code passed to rustc. It's just code now, fully > expanded and standalone. If a user ever changes their `struct MyValue` > definition the `ParquetRecordWriter` will be regenerated. There's no > intermediate values to version control or worry about. > Viewing the Derived Code > === > To see the generated code before it's compiled, one very useful bit is to > install `cargo expand` [more info on > gh](https://github.com/dtolnay/cargo-expand), then you can do: > ``` > $WORK_DIR/parquet-rs/parquet_derive_test > cargo expand --lib > ../temp.rs > ``` > then you can dump the contents: > ``` > struct DumbRecord { > pub a_bool: bool, > pub a2_bool: bool, > } > impl RecordWriter<DumbRecord> for &[DumbRecord] { > fn write_to_row_group( > &self, > row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, > ) { > let mut row_group_writer = row_group_writer; > { > let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); > let mut column_writer = > row_group_writer.next_column().unwrap().unwrap(); > if let > parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = > column_writer > { > typed.write_batch(&vals[..], None, None).unwrap(); > } > row_group_writer.close_column(column_writer).unwrap(); > }; > { > let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); > let mut column_writer = > row_group_writer.next_column().unwrap().unwrap(); > if let > parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = > column_writer > { > typed.write_batch(&vals[..], None, None).unwrap(); > } > row_group_writer.close_column(column_writer).unwrap(); > } > } > } > ``` > now I need to write out all the combinations of types we support and make > sure it writes out data. > Procedural Macros > === > The `parquet_derive` crate can ONLY export the derivation functionality. No > traits, nothing else. The derive crate can not host test cases. It's kind of > like a "dummy" crate which is only used by the compiler, never the code. > The parent crate cannot use the derivation functionality, which is important > because it means test code cannot be in the parent crate. This forces us to > have a third crate, `parquet_derive_test`. > I'm open to being wrong on any one of these finer points. I had to bang on > this for a while to get it to compile! > Potentials For Better Design > === > - [x] Recursion could be limited by generating the code as "snippets" > instead of one big `quote!` AST generator. Or so I think. It might be nicer > to push generating each columns writing code to another loop. > - [X] ~~It would be nicer if I didn't have to be so picky about data going > in to the `write_batch` function. Is it possible we could make a version of > the function which accept `Into<DataType>` or similar? This would greatly > simplify this derivation code as it would not need to enumerate all the > supported types. Something like `write_generic_batch(&[impl Into<DataType>])` > would be neat.~~ (not tackling in this generation of the plugin) > - [X] ~~Another idea to improving writing columns, could we have a write > function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just > write a mapping for accessing the one value, we could skip the whole > intermediate vec for `write_batch`. Should have some significant memory > advantages.~~ (not tackling in this generation of the plugin, it's a bigger > parquet-rs enhancement) > - [X] ~~It might be worthwhile to derive a parquet schema directly from a > struct definition. That should stamp out opportunities for type errors.~~ > (moved to #203) > Status > === > I have successfully integrated this work with my own data exporter (takes > postgres/couchdb and outputs a single parquet file). > I think this code is worth including in the project, with the caveat that it > only generates simplistic `RecordWriter`s. As people start to use we can add > code generation for more complex, nested structs. -- This message was sent by Atlassian Jira (v8.20.10#820010)