jdarais opened a new pull request, #139:
URL: https://github.com/apache/avro-rs/pull/139

   Hi, I started playing with avro-rs to build some Rust-based Avro tools, and 
noticed some things I could take a stab at improving in the avro-rs library 
itself.  This PR has a lot of changes in it, but it mostly centers around a new 
`DirectSerializer`, who's behavior is described below.  Feel free to consider 
this PR as a whole, or take bits and pieces from it.  There are some things in 
it that could be useful even independently of the addition of 
`DirectSerializer`:
   
   This PR proposes a direct serialization of objects when serializing the 
"Serde Way", which provides roughly a 5x performance improvement, and fixes 
issues caused by lost schema information when first converting to a `Value` 
before serializing to the Write stream (#70).
   
   Some notes on these changes:
   - An new Serializer implementation, `DirectSerializer`, provides direct 
serialization of a type using the "Serde Way".  The `DirectSerializer` is 
initialized with the schema to be used for writing data, and uses that schema 
as a guide to know how to serialize the serde types that it encounters.
   - The existing serde benchmark tests seemed to only test serialization of 
`Value` types, which is only part of the process in the "Serde Way", so some 
additional benchmark tests were added to measure the end-to-end performance of 
"Serde Way" serialization
   - The deserializer appears to be missing some "Avro union / Rust enum" 
(de)serialization capability.  In the existing implementation it appears that 
Rust enums can be (de)serialized to/from Avro union types, but only if the 
`#[serde(untagged)]` attribute is added, which serializes an enum value as the 
variant type itself rather than a union type, and it's up to the (de)serializer 
to detect that the schema calls for a union, search the union schemas for a 
match, and use the first schema found that matches the data.  This PR adds a 
`UnionDeserializer` to `de.rs`, which, in combination with `DirectSerializer`, 
allows "Avro union / Rust enum" (de)serialization without requiring the 
`#[serde(untagged)]` attribute.
   - I was able to replicate almost all behavior of the existing implementation 
using `Value` as an intermediate step, but one thing I was not able to 
replicate is serialization of enums that do have the `#[serde(untagged)]` 
attribute.  The closest I got was making the `DirectSerializer` able to search 
a union schema for the first schema _type_ that matches the input data, but in 
the case of "record" types, there's no easy way to ensure that the field names 
are also a match.  This feature of schema matching on write does seem a bit odd 
to me, since the writer should know the structure of the data being written, 
and schema matching should only really be a concern for the reader, where 
schema resolution is happening between a reader and writer schema.  But, worth 
noting that this difference in behavior is not backwards-compatible.
   
   All unit tests are passing.  You can take a look at the changes in the unit 
tests to see where there are differences in behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@avro.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to