jdarais opened a new pull request, #139: URL: https://github.com/apache/avro-rs/pull/139
Hi, I started playing with avro-rs to build some Rust-based Avro tools, and noticed some things I could take a stab at improving in the avro-rs library itself. This PR has a lot of changes in it, but it mostly centers around a new `DirectSerializer`, who's behavior is described below. Feel free to consider this PR as a whole, or take bits and pieces from it. There are some things in it that could be useful even independently of the addition of `DirectSerializer`: This PR proposes a direct serialization of objects when serializing the "Serde Way", which provides roughly a 5x performance improvement, and fixes issues caused by lost schema information when first converting to a `Value` before serializing to the Write stream (#70). Some notes on these changes: - An new Serializer implementation, `DirectSerializer`, provides direct serialization of a type using the "Serde Way". The `DirectSerializer` is initialized with the schema to be used for writing data, and uses that schema as a guide to know how to serialize the serde types that it encounters. - The existing serde benchmark tests seemed to only test serialization of `Value` types, which is only part of the process in the "Serde Way", so some additional benchmark tests were added to measure the end-to-end performance of "Serde Way" serialization - The deserializer appears to be missing some "Avro union / Rust enum" (de)serialization capability. In the existing implementation it appears that Rust enums can be (de)serialized to/from Avro union types, but only if the `#[serde(untagged)]` attribute is added, which serializes an enum value as the variant type itself rather than a union type, and it's up to the (de)serializer to detect that the schema calls for a union, search the union schemas for a match, and use the first schema found that matches the data. This PR adds a `UnionDeserializer` to `de.rs`, which, in combination with `DirectSerializer`, allows "Avro union / Rust enum" (de)serialization without requiring the `#[serde(untagged)]` attribute. - I was able to replicate almost all behavior of the existing implementation using `Value` as an intermediate step, but one thing I was not able to replicate is serialization of enums that do have the `#[serde(untagged)]` attribute. The closest I got was making the `DirectSerializer` able to search a union schema for the first schema _type_ that matches the input data, but in the case of "record" types, there's no easy way to ensure that the field names are also a match. This feature of schema matching on write does seem a bit odd to me, since the writer should know the structure of the data being written, and schema matching should only really be a concern for the reader, where schema resolution is happening between a reader and writer schema. But, worth noting that this difference in behavior is not backwards-compatible. All unit tests are passing. You can take a look at the changes in the unit tests to see where there are differences in behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@avro.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org