Ten created AVRO-3714:
-------------------------

             Summary: Zero-copy (de)serialization - (de)serialization rewrite 
takeover?
                 Key: AVRO-3714
                 URL: https://issues.apache.org/jira/browse/AVRO-3714
             Project: Apache Avro
          Issue Type: Improvement
          Components: rust
            Reporter: Ten


Soo... I ended up taking up [this 
invitation|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649163&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649163].
 :)

Using this library and being severely cpu-capped by the deserialization 
performance (a feat I had never been able to achieve with Rust before), I've 
given a go at fixing [https://github.com/flavray/avro-rs/issues/195].

I initially started writing in the same repository, but as I went through the 
existing code and noticed so many opportunities for improvement and wanted to 
try so significantly different design, and as my code was so completely 
separate and independent from it, I ended up just splitting it out in a 
separate repository, and somewhat accidentally ended up with a full-featured 
deserialization library. (For context, I've been a professional Rust developer 
for years and I've regularly worked with Serde's internals.)

This deserialization code achieves >10x performance gains, seems simpler to use 
while still being as flexible as necessary, and passes all the relevant tests I 
could find (besides 
[AVRO-3240|https://github.com/apache/avro/pull/1379#issuecomment-1420386540], 
intentionally, for reason explained there).

It uses this apache-avro library as a dependency for initial schema parsing.

It would probably be reasonably easily extendable similarly (using the same 
pattern) to serializalization, fixing [the currently pending serialization 
issue|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649103].
 

There are a couple major design differences:
 * {{Value}} is removed. Deserialization is a one-step process, which is fully 
serde-integrated, and leverages its zero-copy features. The output struct can 
now borrow from the source slice.
 ** Having an intermediate {{Value}} representation appears to be unnecessary 
in Rust, as the two use-cases for {{Value}} would seem to be:
 *** Somewhat-known structure of data but still some amount of dynamic 
processing -> You can deserialize to somewhat-dynamic rust types, e.g. 
{{HashMap}}, {{Vec}}... 
 *** Transcoding to a different serialization format (e.g. JSON) with basically 
zero structural information -> This can still be achieved in a much more 
performant and idiomatic manner using 
[serde_transcode|https://crates.io/crates/serde-transcode].
 ** The {{Value}} representation hurts performance compared to deserializing 
right away to the correct struct (especially when said representation involves 
as many allocations as this one does).
 * Reader schema concept is removed. It appeared to be largely unnecessary in 
Rust, as it is a fully statically typed language, and the [deserialization 
hints|https://serde.rs/impl-deserializer.html] provided by the struct through 
the Serde framework combined with the writer schema information give all that 
is necessary to construct the correct types directly, without the need for a 
separate schema.
 ** I expect that any code that currently uses a reader schema would work out 
of the box with this new deserializer without the need to specify a reader 
schema at all.
 ** If needing to convert Avro byte streams from one schema to another, this 
could likely be achieved simply by plugging the deserializer to the serializer 
through [serde_transcode|https://crates.io/crates/serde-transcode], as such 
serializer would ([unlike the current 
one|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649103])
 combine the types provided from the original struct (or in this case, 
deserializer) with the schema variant to remap the values in a correct way, 
while preserving zero-alloc.
 * Schema representation is reworked to be a pre-computed self-referential DAG 
structure.
 ** This is what allows for maximum performance when traveling it during 
de(serialization) operations.

It supports any deserialization target type I could think of (besides [some 
TODO|https://github.com/Ten0/serde_avro_fast/blob/10e6ad00fd5b0770a60ca09b4487aa00e8868313/src/de/deserializer/mod.rs#L282]
 left in the code), including advanced union usage with (or without) enums, as 
well as proper Option support. I would encourage you to test any exotic 
use-case you have in mind and see if that works.

It is extensively documented (I'm hosting documentation 
[here|https://ten0.github.io/serde_avro_fast_doc/serde_avro_fast/] temporarily 
so you can browse it easily, while full source code is 
[here|https://github.com/Ten0/serde_avro_fast]).


So now my questions mainly are:
 * Does it look like this should/could be taken over by apache-avro, replacing 
the implementation originally written by flavray in avro-rs
 * Or should I release it as a separate crate?
 * Can you think of common use-cases that would be prevented by the design 
choice of completely removing the avro {{Value}} and reader schema concepts 
from a Rust (de)serialization library?
 * How is [the per-language-releases 
project|https://lists.apache.org/thread/2rfnszd4dk36jxynpj382b1717gbyv1y] 
going? ^^ (Wouldn't like it to take months to get a new feature out if I were 
to add one ;) )

Thanks,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to