[ 
https://issues.apache.org/jira/browse/HUDI-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang updated HUDI-9648:
------------------------------
    Description: 
h2. Functional requirement
in {{org.apache.hudi.metadata.HoodieBackedTableMetadata}} , we should provide 2 
APIs for partitioned RLI lookup
 

class PartitionedRecordLevelIndexPrefixKey implements RawKey \{ private String 
recordKey; ... }

readPartitionedRecordIndexLocation( 
HoodieData<PartitionedRecordLevelIndexPrefixKey> recordKeys)

 
so it works for partitoned RLI. Behavior is for all record keys to lookup, it 
poke into all partitioned RLI file groups and do lookup there.
 

class PartitionedRecordLevelIndexKey implements RawKey \{ private String 
recordKey; private String partitionKey; ... } 
readPartitionedRecordIndexLocation( HoodieData<PartitionedRecordLevelIndexKey> 
recordKeys)

 
It takes pair of record key and partition key as input and only look into 1 
file group that the key belongs to.
 
h3. Perf requirement
Also the lookup path should use the dynamic parallelism algorithm 
`org.apache.hudi.common.engine.HoodieEngineContext#mapGroupsByKey` implemented.
 
It should follow a similar flow of how global RLI lookup looks like.
 
No hacky implementation of collect large objects on driver/executor.

  was:
for partitioned RLI or partitioned anything, we should be able to take a hint 
of what partition to look into.
For queries like
select a from t1 join t2 on t1.recKey = t2.c1 and t1.partitionCol=t2.c2

 
the query engine knows what partition could be and today spark already do 
dynamic partition pruning on top of that - The query engine has this info handy.
But today even for index join, the way we combine partition pruning and index 
pruning is inefficient - each prune path prune files separately and then join 
the overlap of the results to figure out what to read. There would be room for 
improvements if we allow deep integration between partition pruning and 
partitioned RLI by just telling RLI what partition we should focus on.
I also suggest to make this partition hint info a general hint as in future for 
other indexes they might also be able to integrate this info.
If this worth a retro, let's create a CU tracking that.


> Parititioned RLI take partition column value as a hint
> ------------------------------------------------------
>
>                 Key: HUDI-9648
>                 URL: https://issues.apache.org/jira/browse/HUDI-9648
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>            Reporter: Davis Zhang
>            Priority: Major
>             Fix For: 1.2.0
>
>
> h2. Functional requirement
> in {{org.apache.hudi.metadata.HoodieBackedTableMetadata}} , we should provide 
> 2 APIs for partitioned RLI lookup
>  
> class PartitionedRecordLevelIndexPrefixKey implements RawKey \{ private 
> String recordKey; ... }
> readPartitionedRecordIndexLocation( 
> HoodieData<PartitionedRecordLevelIndexPrefixKey> recordKeys)
>  
> so it works for partitoned RLI. Behavior is for all record keys to lookup, it 
> poke into all partitioned RLI file groups and do lookup there.
>  
> class PartitionedRecordLevelIndexKey implements RawKey \{ private String 
> recordKey; private String partitionKey; ... } 
> readPartitionedRecordIndexLocation( 
> HoodieData<PartitionedRecordLevelIndexKey> recordKeys)
>  
> It takes pair of record key and partition key as input and only look into 1 
> file group that the key belongs to.
>  
> h3. Perf requirement
> Also the lookup path should use the dynamic parallelism algorithm 
> `org.apache.hudi.common.engine.HoodieEngineContext#mapGroupsByKey` 
> implemented.
>  
> It should follow a similar flow of how global RLI lookup looks like.
>  
> No hacky implementation of collect large objects on driver/executor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to