[ https://issues.apache.org/jira/browse/HUDI-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Y Ethan Guo updated HUDI-9164: ------------------------------ Description: In MDT, we employ a custom way of determining the file group of a metadata record to write to in MDT, by hashing the record key ("HoodieTableMetadataUtil#mapRecordKeyToFileGroupIndex"). This allows hash-based join and lookup based on keys by reading specific file group(s) in the MDT partition only, such as record-level index. However, as the secondary index uses <secondary_column_value$record_key_value> as the metadata payload key, when looking up using "secondary_column_value" only, we cannot use the hash-based join and lookup, as we need to know the full key for determining the file group to read in MDT. Thus we have to scan all file groups in MDT for lookups which is inefficient if the secondary index is huge. This can become a performance bottleneck in looking up a larger number of keys on a large secondary index. To solve the problem, we should use secondary_column_value only for determining the file group while still keeping <secondary_column_value$record_key_value> as the metadata payload record key. By employing this, during lookup, with the secondary_column_value we can determine the file group to read based on the secondary_column_value solely to identify the file group(s) to read, thus the hash-based join and lookup on secondary index. was: h3. MDT sec idx file layout does not favor lookup/join efficiently Regarding the MDT join with an incoming pruning set RDD[Internal Row], the existing secondary index data layout does not favor batch prefix look up. h4. MDT index layout MDT secondary index are using key value pair, where the key uses scheme <data column value><separator><record key value> and value is the file group id. So you can see all records comes with the prefix of the column value. It adopts {*}hash based partitioning{*}, which means it takes Full key <data col value><record key value>, hash it and decide which file group the partition belongs to. h3. The query pattern it serves In a nutshell, the data layout is hash partitioning while the query pattern is prefix lookup, this 2 does not match at all. 2 types of query pattern against the index: * point look up given only a secondary index column value, meaning only {{<data column value>}} is given and we need to look up all file group ids associated * Join with a large amount of column value: this is how secondary index join would work. When joining tableGeneratingPruningSet and tableWithIdxTobePruned, the tableGeneratingPruningSet generates a RDD of values for data column C1, we use this RDD joining with MDT C1 secondary index to figure out file group ids of interest. Here we are looking at join between this RDD and MDT at a large scale. Because we only knew the {{<data column value>}} from the input, which is only the prefix of the secondary index key, so we don't know which bucket the potential MDT records belongs to. As a result, even for point look up we need to load the full MDT and the complexity is O(n). This is not scalable . Needs a improvements on the partition scheme to handle prefix based search at a large scale. > Improve file group sharing strategy of secondary index in MDT > ------------------------------------------------------------- > > Key: HUDI-9164 > URL: https://issues.apache.org/jira/browse/HUDI-9164 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Davis Zhang > Priority: Blocker > Fix For: 1.1.0 > > > In MDT, we employ a custom way of determining the file group of a metadata > record to write to in MDT, by hashing the record key > ("HoodieTableMetadataUtil#mapRecordKeyToFileGroupIndex"). This allows > hash-based join and lookup based on keys by reading specific file group(s) in > the MDT partition only, such as record-level index. > However, as the secondary index uses > <secondary_column_value$record_key_value> as the metadata payload key, when > looking up using "secondary_column_value" only, we cannot use the hash-based > join and lookup, as we need to know the full key for determining the file > group to read in MDT. Thus we have to scan all file groups in MDT for > lookups which is inefficient if the secondary index is huge. This can become > a performance bottleneck in looking up a larger number of keys on a large > secondary index. > To solve the problem, we should use secondary_column_value only for > determining the file group while still keeping > <secondary_column_value$record_key_value> as the metadata payload record key. > By employing this, during lookup, with the secondary_column_value we can > determine the file group to read based on the secondary_column_value solely > to identify the file group(s) to read, thus the hash-based join and lookup on > secondary index. -- This message was sent by Atlassian Jira (v8.20.10#820010)