Ten created AVRO-3714:
-------------------------
Summary: Zero-copy (de)serialization - (de)serialization rewrite
takeover?
Key: AVRO-3714
URL: https://issues.apache.org/jira/browse/AVRO-3714
Project: Apache Avro
Issue Type: Improvement
Components: rust
Reporter: Ten
Soo... I ended up taking up [this
invitation|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649163&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649163].
:)
Using this library and being severely cpu-capped by the deserialization
performance (a feat I had never been able to achieve with Rust before), I've
given a go at fixing [https://github.com/flavray/avro-rs/issues/195].
I initially started writing in the same repository, but as I went through the
existing code and noticed so many opportunities for improvement and wanted to
try so significantly different design, and as my code was so completely
separate and independent from it, I ended up just splitting it out in a
separate repository, and somewhat accidentally ended up with a full-featured
deserialization library. (For context, I've been a professional Rust developer
for years and I've regularly worked with Serde's internals.)
This deserialization code achieves >10x performance gains, seems simpler to use
while still being as flexible as necessary, and passes all the relevant tests I
could find (besides
[AVRO-3240|https://github.com/apache/avro/pull/1379#issuecomment-1420386540],
intentionally, for reason explained there).
It uses this apache-avro library as a dependency for initial schema parsing.
It would probably be reasonably easily extendable similarly (using the same
pattern) to serializalization, fixing [the currently pending serialization
issue|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649103].
There are a couple major design differences:
* {{Value}} is removed. Deserialization is a one-step process, which is fully
serde-integrated, and leverages its zero-copy features. The output struct can
now borrow from the source slice.
** Having an intermediate {{Value}} representation appears to be unnecessary
in Rust, as the two use-cases for {{Value}} would seem to be:
*** Somewhat-known structure of data but still some amount of dynamic
processing -> You can deserialize to somewhat-dynamic rust types, e.g.
{{HashMap}}, {{Vec}}...
*** Transcoding to a different serialization format (e.g. JSON) with basically
zero structural information -> This can still be achieved in a much more
performant and idiomatic manner using
[serde_transcode|https://crates.io/crates/serde-transcode].
** The {{Value}} representation hurts performance compared to deserializing
right away to the correct struct (especially when said representation involves
as many allocations as this one does).
* Reader schema concept is removed. It appeared to be largely unnecessary in
Rust, as it is a fully statically typed language, and the [deserialization
hints|https://serde.rs/impl-deserializer.html] provided by the struct through
the Serde framework combined with the writer schema information give all that
is necessary to construct the correct types directly, without the need for a
separate schema.
** I expect that any code that currently uses a reader schema would work out
of the box with this new deserializer without the need to specify a reader
schema at all.
** If needing to convert Avro byte streams from one schema to another, this
could likely be achieved simply by plugging the deserializer to the serializer
through [serde_transcode|https://crates.io/crates/serde-transcode], as such
serializer would ([unlike the current
one|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649103])
combine the types provided from the original struct (or in this case,
deserializer) with the schema variant to remap the values in a correct way,
while preserving zero-alloc.
* Schema representation is reworked to be a pre-computed self-referential DAG
structure.
** This is what allows for maximum performance when traveling it during
de(serialization) operations.
It supports any deserialization target type I could think of (besides [some
TODO|https://github.com/Ten0/serde_avro_fast/blob/10e6ad00fd5b0770a60ca09b4487aa00e8868313/src/de/deserializer/mod.rs#L282]
left in the code), including advanced union usage with (or without) enums, as
well as proper Option support. I would encourage you to test any exotic
use-case you have in mind and see if that works.
It is extensively documented (I'm hosting documentation
[here|https://ten0.github.io/serde_avro_fast_doc/serde_avro_fast/] temporarily
so you can browse it easily, while full source code is
[here|https://github.com/Ten0/serde_avro_fast]).
So now my questions mainly are:
* Does it look like this should/could be taken over by apache-avro, replacing
the implementation originally written by flavray in avro-rs
* Or should I release it as a separate crate?
* Can you think of common use-cases that would be prevented by the design
choice of completely removing the avro {{Value}} and reader schema concepts
from a Rust (de)serialization library?
* How is [the per-language-releases
project|https://lists.apache.org/thread/2rfnszd4dk36jxynpj382b1717gbyv1y]
going? ^^ (Wouldn't like it to take months to get a new feature out if I were
to add one ;) )
Thanks,
--
This message was sent by Atlassian Jira
(v8.20.10#820010)