alexeykudinkin commented on code in PR #5244: URL: https://github.com/apache/hudi/pull/5244#discussion_r846678629
########## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ########## @@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession, * @return list of pruned (data-skipped) candidate base-files' names */ private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try { - if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig) - .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) { + // NOTE: Data Skipping is only effective when it references columns that are indexed w/in + // the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping: + // - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since + // CSI only contains stats for top-level columns, in this case for "struct") + // - Any expression not directly referencing top-level column (for ex, sub-queries, since there's + // nothing CSI in particular could be applied for) + lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema) + + if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) { Review Comment: > given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files? It depends on the predicate, but we will at least try to leverage it to filter out for top-level columns only > So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ? We can't do that, we have to play by what's actually in index: this is handled when we execute the filter against lookup table -- if it doesn't contain the column of the filter, it will just match all of the files. ########## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ########## @@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession, * @return list of pruned (data-skipped) candidate base-files' names */ private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try { - if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig) - .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) { + // NOTE: Data Skipping is only effective when it references columns that are indexed w/in + // the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping: + // - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since + // CSI only contains stats for top-level columns, in this case for "struct") + // - Any expression not directly referencing top-level column (for ex, sub-queries, since there's + // nothing CSI in particular could be applied for) + lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema) + + if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) { Review Comment: However, your question made me realize that we're actually deriving index schema incorrectly currently. Let me address that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org