[ 
https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-5123:
------------------------------
    External issue URL: https://github.com/apache/arrow/issues/21608

> [Rust] derive RecordWriter from struct definitions
> --------------------------------------------------
>
>                 Key: ARROW-5123
>                 URL: https://issues.apache.org/jira/browse/ARROW-5123
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Rust
>            Reporter: Xavier Lange
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> Migrated from previous github issue (which saw a lot of comments but at a 
> rough transition time in the project): 
> https://github.com/sunchao/parquet-rs/pull/197
>  
> Goal
> ===
> Writing many columns to a file is a chore. If you can put your values in to a 
> struct which mirrors the schema of your file, this 
> `derive(ParquetRecordWriter)` will write out all the fields, in the order in 
> which they are defined, to a row_group.
> How to Use
> ===
> ```
> extern crate parquet;
> #[macro_use] extern crate parquet_derive;
> #[derive(ParquetRecordWriter)]
> struct ACompleteRecord<'a> {
>   pub a_bool: bool,
>   pub a_str: &'a str,
> }
> ```
> RecordWriter trait
> ===
> This is the new trait which `parquet_derive` will implement for your structs.
> ```
> use super::RowGroupWriter;
> pub trait RecordWriter<T> {
>   fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
> }
> ```
> How does it work?
> ===
> The `parquet_derive` crate adds code generating functionality to the rust 
> compiler. The code generation takes rust syntax and emits additional syntax. 
> This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, 
> loaded by the machinery in cargo. Users don't have to do any special 
> `build.rs` steps or anything like that, it's automatic by including 
> `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a 
> section saying as much:
> ```
> [lib]
> proc-macro = true
> ```
> The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to 
> the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The 
> `syn` crate parses the struct from a string-representation to a AST (a 
> recursive enum value). The AST contains all the values I care about when 
> generating a `RecordWriter` impl:
>  - the name of the struct
>  - the lifetime variables of the struct
>  - the fields of the struct
> The fields of the struct are translated from AST to a flat `FieldInfo` 
> struct. It has the bits I care about for writing a column: `field_name`, 
> `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.
> The code then does the equivalent of templating to build the `RecordWriter` 
> implementation. The templating functionality is provided by the `quote` 
> crate. At a high-level the template for `RecordWriter` looks like:
> ```
> impl RecordWriter for $struct_name {
>   fn write_row_group(..) {
>     $({
>       $column_writer_snippet
>     })
>   } 
> }
> ```
> this template is then added under the struct definition, ending up something 
> like:
> ```
> struct MyStruct {
> }
> impl RecordWriter for MyStruct {
>   fn write_row_group(..) {
>     {
>        write_col_1();
>     };
>    {
>        write_col_2();
>    }
>   }
> }
> ```
> and finally _THIS_ is the code passed to rustc. It's just code now, fully 
> expanded and standalone. If a user ever changes their `struct MyValue` 
> definition the `ParquetRecordWriter` will be regenerated. There's no 
> intermediate values to version control or worry about.
> Viewing the Derived Code
> ===
> To see the generated code before it's compiled, one very useful bit is to 
> install `cargo expand` [more info on 
> gh](https://github.com/dtolnay/cargo-expand), then you can do:
> ```
> $WORK_DIR/parquet-rs/parquet_derive_test
> cargo expand --lib > ../temp.rs
> ```
> then you can dump the contents:
> ```
> struct DumbRecord {
>     pub a_bool: bool,
>     pub a2_bool: bool,
> }
> impl RecordWriter<DumbRecord> for &[DumbRecord] {
>     fn write_to_row_group(
>         &self,
>         row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
>     ) {
>         let mut row_group_writer = row_group_writer;
>         {
>             let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
>             let mut column_writer = 
> row_group_writer.next_column().unwrap().unwrap();
>             if let 
> parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
>                 column_writer
>             {
>                 typed.write_batch(&vals[..], None, None).unwrap();
>             }
>             row_group_writer.close_column(column_writer).unwrap();
>         };
>         {
>             let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
>             let mut column_writer = 
> row_group_writer.next_column().unwrap().unwrap();
>             if let 
> parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
>                 column_writer
>             {
>                 typed.write_batch(&vals[..], None, None).unwrap();
>             }
>             row_group_writer.close_column(column_writer).unwrap();
>         }
>     }
> }
> ```
> now I need to write out all the combinations of types we support and make 
> sure it writes out data.
> Procedural Macros
> ===
> The `parquet_derive` crate can ONLY export the derivation functionality. No 
> traits, nothing else. The derive crate can not host test cases. It's kind of 
> like a "dummy" crate which is only used by the compiler, never the code.
> The parent crate cannot use the derivation functionality, which is important 
> because it means test code cannot be in the parent crate. This forces us to 
> have a third crate, `parquet_derive_test`.
> I'm open to being wrong on any one of these finer points. I had to bang on 
> this for a while to get it to compile!
> Potentials For Better Design
> ===
>  - [x] Recursion could be limited by generating the code as "snippets" 
> instead of one big `quote!` AST generator. Or so I think. It might be nicer 
> to push generating each columns writing code to another loop.
>  - [X] ~~It would be nicer if I didn't have to be so picky about data going 
> in to the `write_batch` function. Is it possible we could make a version of 
> the function which accept `Into<DataType>` or similar? This would greatly 
> simplify this derivation code as it would not need to enumerate all the 
> supported types. Something like `write_generic_batch(&[impl Into<DataType>])` 
> would be neat.~~ (not tackling in this generation of the plugin)
>  - [X] ~~Another idea to improving writing columns, could we have a write 
> function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just 
> write a mapping for accessing the one value, we could skip the whole 
> intermediate vec for `write_batch`. Should have some significant memory 
> advantages.~~ (not tackling in this generation of the plugin, it's a bigger 
> parquet-rs enhancement)
>  - [X] ~~It might be worthwhile to derive a parquet schema directly from a 
> struct definition. That should stamp out opportunities for type errors.~~ 
> (moved to #203)
> Status
> ===
> I have successfully integrated this work with my own data exporter (takes 
> postgres/couchdb and outputs a single parquet file).
> I think this code is worth including in the project, with the caveat that it 
> only generates simplistic `RecordWriter`s. As people start to use we can add 
> code generation for more complex, nested structs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to