[ https://issues.apache.org/jira/browse/HUDI-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519769#comment-17519769 ]
Alexey Kudinkin commented on HUDI-3834: --------------------------------------- Original runtime of the query reading whole CSI from MT (takes about 60s): !Screen Shot 2022-04-07 at 7.48.04 PM.png|width=610,height=763! Taking up some basic CPU sampling profile, reveals that most of the time *~55%* is spent in `SpecificData.getForSchema`: !Screen Shot 2022-04-07 at 7.46.01 PM.png|width=1321,height=771! > Evaluate MT Column Stats Performance > ------------------------------------- > > Key: HUDI-3834 > URL: https://issues.apache.org/jira/browse/HUDI-3834 > Project: Apache Hudi > Issue Type: Bug > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Fix For: 0.11.0 > > Attachments: Screen Shot 2022-04-07 at 7.46.01 PM.png, Screen Shot > 2022-04-07 at 7.48.04 PM.png > > > h3. *UPDATE* > -------- > *TL;DR* After identifying the bottlenecks as Avro 1.10 regression in > generated Builder classes, relying on `SpecificData.getForSchema` call that > loads corresponding model's class using reflection, we're able to bring down > in a similar setting the runtime of reading for Column Stats Index from MT > containing 800k records from *60s* to *3s* (20x) > > Follow the comments for updates on that. > -------- > Previously, while evaluating Data Skipping runtime in EMR setting, it was > measured that reading of Column Stats Index of about ~800k records takes > about *60s* (~10ms / record). > > Given that the total size of the log-files involved is ~60Mb, seems like > there are some performance bottlenecks that we should investigate before 0.11 > release -- This message was sent by Atlassian Jira (v8.20.1#820001)