[
https://issues.apache.org/jira/browse/AVRO-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703867#comment-16703867
]
ASF GitHub Bot commented on AVRO-2247:
--------------------------------------
rstata commented on issue #391: AVRO-2247 - improved java reading performance
with new reader
URL: https://github.com/apache/avro/pull/391#issuecomment-443005833
On the one hand, the performance results I posted a few days ago certainly
demonstrate there is some perfomance improvements to be had for
GenericDatumReader.
On the other hand, this change introduces 2800 lines of new code that looks
like it'd be tedious to maintain. Also, the comparison here isn't apples to
apples, because the old code is more aggressive about reusing objects, and it
attempts to apply conversions, which is pure overhead for the performance tests
we're using but aren't in other cases. Finally, looking more closely at
GenericDatumReader, it has built into it a BUNCH of "customization" points --
methods and objects that can be replaced to customize the reading process, all
of which add overhead in the inner-most loop. It's not clear whether how much
of the performance gains come from the pre-computation of actions versus simply
getting rid of all these customization points.
I'm tempted to extend the AVRO-2275 work so that the Action-tree generated
by Resolver is a complete mirror of the reader's schema (right now, it stops at
DoNothing nodes, which for Unions in particular could be pretty high-up in the
schema's tree). Then one could write a FastGenericDatumReader class that
simply walks that tree to decode the object. I suspect the resulting code
would be on the order of 100 lines and would capture almost all the speed found
in this fast-avro patch. (And one could decorate the Action objects with any
Conversions for LogicalTypes found in the reader's schema, making it quick and
easy to apply conversions while doing the walk.)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Improve Java reading performance with a new reader
> --------------------------------------------------
>
> Key: AVRO-2247
> URL: https://issues.apache.org/jira/browse/AVRO-2247
> Project: Apache Avro
> Issue Type: Improvement
> Components: java
> Reporter: Martin Jubelgas
> Priority: Major
> Fix For: 1.9.0
>
> Attachments: Perf-Comparison.md
>
>
> Complementary to AVRO-2090, I have been working on decoding of Avro objects
> in Java and am suggesting a new implementation of a DatumReader that improves
> read performance for both generic and specific records by approximately 20%
> (and even more in cases of nested objects with defaults, a case I encounter a
> lot in practical use).
> Key concept is to create a detailed execution plan once at DatumReader. This
> execution plan contains all required defaulting/lookup values so they need
> not be looked up during object traversal while reading.
> The reader implementation can be enabled and disabled per GenericData
> instance. The system default is set via the system variable
> "org.apache.avro.fastread" (defaults to "false").
> Attached a performance comparison of the existing implementation with the
> proposed one. Will open a pull request with respective code in a bit (not
> including interoperability with the optimizations of AVRO-2090 yet). Please
> let me know your opinion of whether this is worth pursuing further.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)