[
https://issues.apache.org/jira/browse/AVRO-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681607#comment-16681607
]
ASF GitHub Bot commented on AVRO-2247:
--------------------------------------
rstata commented on issue #354: AVRO-2247 - improved java reading performance
with new reader
URL: https://github.com/apache/avro/pull/354#issuecomment-437398604
I've been meaning to comment on this for a while. Looking at your code
quickly, I wasn't convinced that it worked for recursive records (and maybe not
even for nested records). Also, the solution as posted re-implements schema
resolution. The schema-resolution code is subjected to a large number of
regression tests that came about because the resolution logic is subtle in
places. A re-implementation of that logic should subject itself to that test
suite, which yours does not.
Inspired by both your JIRA (AVRO-2247) and my own thoughts about further
improving performance of reading with resolution, I have refactored the
schema-resolution logic away from the resolving-grammar generation logic. I
have published this in the branch
[`'refactor-resolving-2018-11-09`](https://github.com/rstata-projects/avro/tree/refactor-resolving-2018-11-09)
of my Avro fork. This code is "bug for bug" compatible with Avro's existing
schema resolution (e.g., it implements the funky "best match" algorithm
currently used for unions), and it passes the full schema-resolution regression
suite.
You might want to look to see if this would be a good foundation for
implementing your improvements. Start at the new
[Resolver](https://github.com/rstata-projects/avro/blob/refactor-resolving-2018-11-09/lang/java/avro/src/main/java/org/apache/avro/Resolver.java)
class, and also look to see how ResolvingGrammarGenerator uses the output of
Resolver. However, be warned that I intend to "re-write history" on this code
pretty severely before proposing it as an actual improvement, so you might want
to wait about a week before actually depending upon this code.
Over the last few weeks I've been working on the performance-testing suite.
What I've found is that the variance between runs of this suite varies
_significantly:_ in places, over 30%! Across the board, I see variance of over
5% between runs for over 40% of the test cases. With this much variance, it's
impossible to say if a proposed performance improvement is really an
improvement (and impossible to tell whether or not an attempt to improve one
set of performance cases has degraded performance elsewhere).
By the end of next week I should have a proposed set of changes to the
performance benchmark, plus a "cookbook" for using it (on AWS), which minimizes
variance between runs of the suite. With that in place, I will return to the
`refactor-resolving` work and submit it along with testing that shows it
doesn't degrade performance (and, in fact, improves it in places).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Improve Java reading performance with a new reader
> --------------------------------------------------
>
> Key: AVRO-2247
> URL: https://issues.apache.org/jira/browse/AVRO-2247
> Project: Apache Avro
> Issue Type: Improvement
> Components: java
> Reporter: Martin Jubelgas
> Priority: Major
> Fix For: 1.9.0
>
> Attachments: Perf-Comparison.md
>
>
> Complementary to AVRO-2090, I have been working on decoding of Avro objects
> in Java and am suggesting a new implementation of a DatumReader that improves
> read performance for both generic and specific records by approximately 20%
> (and even more in cases of nested objects with defaults, a case I encounter a
> lot in practical use).
> Key concept is to create a detailed execution plan once at DatumReader. This
> execution plan contains all required defaulting/lookup values so they need
> not be looked up during object traversal while reading.
> The reader implementation can be enabled and disabled per GenericData
> instance. The system default is set via the system variable
> "org.apache.avro.fastread" (defaults to "false").
> Attached a performance comparison of the existing implementation with the
> proposed one. Will open a pull request with respective code in a bit (not
> including interoperability with the optimizations of AVRO-2090 yet). Please
> let me know your opinion of whether this is worth pursuing further.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)