[ 
https://issues.apache.org/jira/browse/HUDI-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-9164:
------------------------------
    Description: 
In MDT, we employ a custom way of determining the file group of a metadata 
record to write to in MDT, by hashing the record key 
("HoodieTableMetadataUtil#mapRecordKeyToFileGroupIndex").  This allows 
hash-based join and lookup based on keys by reading specific file group(s) in 
the MDT partition only, such as record-level index.

However, as the secondary index uses <secondary_column_value$record_key_value> 
as the metadata payload key, when looking up using "secondary_column_value" 
only, we cannot use the hash-based join and lookup, as we need to know the full 
key for determining the file group to read in MDT.  Thus we have to scan all 
file groups in MDT for lookups which is inefficient if the secondary index is 
huge.  This can become a performance bottleneck in looking up a larger number 
of keys on a large secondary index.

To solve the problem, we should use secondary_column_value only for determining 
the file group while still keeping <secondary_column_value$record_key_value> as 
the metadata payload record key.  By employing this, during lookup, with the 
secondary_column_value we can determine the file group to read based on the 
secondary_column_value solely to identify the file group(s) to read, thus the 
hash-based join and lookup on secondary index.

  was:
h3. MDT sec idx file layout does not favor lookup/join efficiently
Regarding the MDT join with an incoming pruning set RDD[Internal Row], the 
existing secondary index data layout does not favor batch prefix look up.
 
h4. MDT index layout
MDT secondary index are using key value pair, where the key uses scheme
<data column value><separator><record key value>
and value is the file group id.
 
So you can see all records comes with the prefix of the column value.
 
It adopts {*}hash based partitioning{*}, which means it takes Full key <data 
col value><record key value>, hash it and decide which file group the partition 
belongs to.
 
h3. The query pattern it serves
 
In a nutshell, the data layout is hash partitioning while the query pattern is 
prefix lookup, this 2 does not match at all.
 
2 types of query pattern against the index: * point look up given only a 
secondary index column value, meaning only {{<data column value>}} is given and 
we need to look up all file group ids associated
 * Join with a large amount of column value: this is how secondary index join 
would work. When joining tableGeneratingPruningSet and tableWithIdxTobePruned, 
the tableGeneratingPruningSet generates a RDD of values for data column C1, we 
use this RDD joining with MDT C1 secondary index to figure out file group ids 
of interest. Here we are looking at join between this RDD and MDT at a large 
scale.

 
Because we only knew the {{<data column value>}} from the input, which is only 
the prefix of the secondary index key, so we don't know which bucket the 
potential MDT records belongs to. As a result, even for point look up we need 
to load the full MDT and the complexity is O(n).
 
This is not scalable .
 
Needs a improvements on the partition scheme to handle prefix based search at a 
large scale.


> Improve file group sharing strategy of secondary index in MDT
> -------------------------------------------------------------
>
>                 Key: HUDI-9164
>                 URL: https://issues.apache.org/jira/browse/HUDI-9164
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Davis Zhang
>            Priority: Blocker
>             Fix For: 1.1.0
>
>
> In MDT, we employ a custom way of determining the file group of a metadata 
> record to write to in MDT, by hashing the record key 
> ("HoodieTableMetadataUtil#mapRecordKeyToFileGroupIndex").  This allows 
> hash-based join and lookup based on keys by reading specific file group(s) in 
> the MDT partition only, such as record-level index.
> However, as the secondary index uses 
> <secondary_column_value$record_key_value> as the metadata payload key, when 
> looking up using "secondary_column_value" only, we cannot use the hash-based 
> join and lookup, as we need to know the full key for determining the file 
> group to read in MDT.  Thus we have to scan all file groups in MDT for 
> lookups which is inefficient if the secondary index is huge.  This can become 
> a performance bottleneck in looking up a larger number of keys on a large 
> secondary index.
> To solve the problem, we should use secondary_column_value only for 
> determining the file group while still keeping 
> <secondary_column_value$record_key_value> as the metadata payload record key. 
>  By employing this, during lookup, with the secondary_column_value we can 
> determine the file group to read based on the secondary_column_value solely 
> to identify the file group(s) to read, thus the hash-based join and lookup on 
> secondary index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to