Xavier Lange created ARROW-5123:
-----------------------------------

             Summary: [Rust] derive RecordWriter from struct definitions
                 Key: ARROW-5123
                 URL: https://issues.apache.org/jira/browse/ARROW-5123
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Rust
            Reporter: Xavier Lange


Migrated from previous github issue (which saw a lot of comments but at a rough 
transition time in the project): https://github.com/sunchao/parquet-rs/pull/197

 

Goal

===

Writing many columns to a file is a chore. If you can put your values in to a 
struct which mirrors the schema of your file, this 
`derive(ParquetRecordWriter)` will write out all the fields, in the order in 
which they are defined, to a row_group.

How to Use
===

```
extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}
```

RecordWriter trait
===

This is the new trait which `parquet_derive` will implement for your structs.

```
use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}
```

How does it work?
===

The `parquet_derive` crate adds code generating functionality to the rust 
compiler. The code generation takes rust syntax and emits additional syntax. 
This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, 
loaded by the machinery in cargo. Users don't have to do any special `build.rs` 
steps or anything like that, it's automatic by including `parquet_derive` in 
their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

```
[lib]
proc-macro = true
```

The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the 
`parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` 
crate parses the struct from a string-representation to a AST (a recursive enum 
value). The AST contains all the values I care about when generating a 
`RecordWriter` impl:

 - the name of the struct
 - the lifetime variables of the struct
 - the fields of the struct

The fields of the struct are translated from AST to a flat `FieldInfo` struct. 
It has the bits I care about for writing a column: `field_name`, 
`field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

The code then does the equivalent of templating to build the `RecordWriter` 
implementation. The templating functionality is provided by the `quote` crate. 
At a high-level the template for `RecordWriter` looks like:

```
impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  } 
}
```

this template is then added under the struct definition, ending up something 
like:

```
struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}
```

and finally _THIS_ is the code passed to rustc. It's just code now, fully 
expanded and standalone. If a user ever changes their `struct MyValue` 
definition the `ParquetRecordWriter` will be regenerated. There's no 
intermediate values to version control or worry about.

Viewing the Derived Code
===

To see the generated code before it's compiled, one very useful bit is to 
install `cargo expand` [more info on 
gh](https://github.com/dtolnay/cargo-expand), then you can do:

```
$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs
```

then you can dump the contents:

```
struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = 
row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref 
mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = 
row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref 
mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}
```

now I need to write out all the combinations of types we support and make sure 
it writes out data.

Procedural Macros
===

The `parquet_derive` crate can ONLY export the derivation functionality. No 
traits, nothing else. The derive crate can not host test cases. It's kind of 
like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important 
because it means test code cannot be in the parent crate. This forces us to 
have a third crate, `parquet_derive_test`.

I'm open to being wrong on any one of these finer points. I had to bang on this 
for a while to get it to compile!

Potentials For Better Design
===

 - [x] Recursion could be limited by generating the code as "snippets" instead 
of one big `quote!` AST generator. Or so I think. It might be nicer to push 
generating each columns writing code to another loop.
 - [X] ~~It would be nicer if I didn't have to be so picky about data going in 
to the `write_batch` function. Is it possible we could make a version of the 
function which accept `Into<DataType>` or similar? This would greatly simplify 
this derivation code as it would not need to enumerate all the supported types. 
Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ 
(not tackling in this generation of the plugin)
 - [X] ~~Another idea to improving writing columns, could we have a write 
function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just 
write a mapping for accessing the one value, we could skip the whole 
intermediate vec for `write_batch`. Should have some significant memory 
advantages.~~ (not tackling in this generation of the plugin, it's a bigger 
parquet-rs enhancement)
 - [X] ~~It might be worthwhile to derive a parquet schema directly from a 
struct definition. That should stamp out opportunities for type errors.~~ 
(moved to #203)

Status
===

I have successfully integrated this work with my own data exporter (takes 
postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it 
only generates simplistic `RecordWriter`s. As people start to use we can add 
code generation for more complex, nested structs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to