[ https://issues.apache.org/jira/browse/HUDI-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Kudinkin updated HUDI-3834: ---------------------------------- Description: h3. *UPDATE* -------- *TL;DR* After identifying the bottlenecks as Avro 1.10 regression in generated Builder classes, relying on `SpecificData.getForSchema` call that loads corresponding model's class using reflection, we're able to bring down in a similar setting the runtime of reading for Column Stats Index from MT containing 800k records from *60s* to *3s* (20x) Follow the comments for updates on that. -------- Previously, while evaluating Data Skipping runtime in EMR setting, it was measured that reading of Column Stats Index of about ~800k records takes about *60s* (~10ms / record). Given that the total size of the log-files involved is ~60Mb, seems like there are some performance bottlenecks that we should investigate before 0.11 release was: Previously, while evaluating Data Skipping runtime in EMR setting, it was measured that reading of Column Stats Index of about ~800k records takes about *60s* (~10ms / record). Given that the total size of the log-files involved is ~60Mb, seems like there are some performance bottlenecks that we should investigate before 0.11 release > Evaluate MT Column Stats Performance > ------------------------------------- > > Key: HUDI-3834 > URL: https://issues.apache.org/jira/browse/HUDI-3834 > Project: Apache Hudi > Issue Type: Bug > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Fix For: 0.11.0 > > > h3. *UPDATE* > -------- > *TL;DR* After identifying the bottlenecks as Avro 1.10 regression in > generated Builder classes, relying on `SpecificData.getForSchema` call that > loads corresponding model's class using reflection, we're able to bring down > in a similar setting the runtime of reading for Column Stats Index from MT > containing 800k records from *60s* to *3s* (20x) > > Follow the comments for updates on that. > -------- > Previously, while evaluating Data Skipping runtime in EMR setting, it was > measured that reading of Column Stats Index of about ~800k records takes > about *60s* (~10ms / record). > > Given that the total size of the log-files involved is ~60Mb, seems like > there are some performance bottlenecks that we should investigate before 0.11 > release -- This message was sent by Atlassian Jira (v8.20.1#820001)