[ https://issues.apache.org/jira/browse/HUDI-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu updated HUDI-3896: ----------------------------- Sprint: Hudi-Sprint-Apr-12, Hudi-Sprint-Apr-19 (was: Hudi-Sprint-Apr-12) > Support Spark optimizations for `HadoopFsRelation` > -------------------------------------------------- > > Key: HUDI-3896 > URL: https://issues.apache.org/jira/browse/HUDI-3896 > Project: Apache Hudi > Issue Type: Bug > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Fix For: 0.12.0 > > Attachments: Screen Shot 2022-04-16 at 1.46.50 PM.png > > > After migrating to Hudi's own Relation impls, we unfortunately broke off some > of the optimizations that Spark apply exclusively for `HadoopFsRelation`. > > While these optimizations could be perfectly implemented for any > `FileRelation`, Spark is unfortunately predicating them on usage of > HadoopFsRelation, therefore making them non-applicable to any of the Hudi's > relations. > Proper longterm solutions would be fixing this in Spark and could be either > of: > # Generalizing such optimizations to any `FileRelation` > # Making `HadoopFsRelation` extensible (making it non-case class) > > One example of this is Spark's `SchemaPrunning` optimization rule > (HUDI-3891): Spark 3.2.x is able to effectively reduce amount of data read > via schema pruning (projecting read data) even for nested structs, however > this optimization is predicated on the usage of `HadoopFsRelation`: > !Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143! -- This message was sent by Atlassian Jira (v8.20.7#820007)