[jira] [Updated] (HUDI-9566) Secondary index convert everything to string

Davis Zhang (Jira) Sun, 13 Jul 2025 11:32:56 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-9566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davis Zhang updated HUDI-9566:
------------------------------
    Description: 
h1. Action items
For SI, we only allow the following data types are involved, on both read and 
write path * String,
 * Integer types, including Int, BigInt, Long, Short int.
 * timestamp
 * Char
 * Boolean: Due to low cardinality we should not support boolean

It means: * [Blocker AI] We only allow SI creation on columns with above data 
types.
 * [Blocker AI] On the index lookup path, only when the look up set shares the 
same data type (integer types we allow them to be different, int v.s. bigint is 
fine), we allow index lookup. otherwise we can do any of the following
 ** [Most ideal] Fall back to no index lookup
 ** [back up plan] Query error out
 ** [nice to have, non blocker] best effort type casting if viable. Cast 
failure we fall back.

This blocker AI applies to all code path that leads to index look up, which is 
captured by the "read path" section.
 

[Blocker AI] To ensure that columns with SI get a consistent toString behavior, 
we will ban schema evolution for those columns.

 

*[Warning]* We only fix spark code path, for other query engine, we need 
separate plan and owners. 
h1. Description

when generating secondary index record from a data column which could be of any 
type, we convert that "any type" to string via "toString". RLI probably is 
doing similar things.
h2. Read path

So SI lookup, the lookup key has to generate the matching string, despite that 
the look up key provider may provide keys in various data types. This can cause 
surprises.

 

For example, if a SI is built out of a float column, the SI will use 
float::ToString and index using string literal "10.0", yet when we do index 
lookup, we provide a lookup set of  long/int type. It means the toString will 
generate strings like "10"

10.0 == 10 is true, but "10.0".equals("10") is false

This means even if values are numerically the same, we can still fail to lookup 
the value and thus cause correctness issues. 

 

In case of spark SQL, the look up set can be generated by 

 

select * from tbl where secKey in (xxxx), where xxx can be static val or some 
subquery in the future.

select * from tbl where secKey= xxx

 

This is purely controlled by the lookup set provider, which is spark, based on 
how the query is written. So the data type is beyond our control without 
explicit type cast.

 
h2. Write path

As of today, SI update always generate records from the data column, and data 
column is of a data type consistently, unless there is a schema evolution. As 
long as no schema evolution happens, it should be fine. 

 

With schema evolution changing data type (for example, int -> float), the 
toString method will start to work differently. This means if previously SI 
track string literal of int 10 as "10", now after bump up to float, it will 
track as "10.0". Thus, even for 1 single record, we might have 2 SI records

 

 

Also for timestamp, based on different timezone, even if they are the same UTC 
time, but one may be in PST and the other is ET, they will give different 
string and cause mismatch. The list can keep going as we evaluate all possible 
data types.

 
{code:java}
/**
 * Constructs an iterator with a pair of the record key and the secondary index 
value for each record in the file slice.
 */
private static <T> ClosableIterator<Pair<String, String>> 
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
                                                                                
              HoodieTableMetaClient metaClient,                                 
                                  FileSlice fileSlice,                          
                                   Schema tableSchema,                          
                                   HoodieIndexDefinition indexDefinition,       
                                   String instantTime,                          
                              TypedProperties props,                            
                            boolean allowInflightInstants) throws IOException {

while (recordIterator.hasNext()) {
  T record = recordIterator.next();
  Object secondaryKey = readerContext.getValue(record, tableSchema, 
secondaryKeyField);
  if (secondaryKey != null) {
    nextValidRecord = Pair.of(
        readerContext.getRecordKey(record, tableSchema),
        secondaryKey.toString()
    );
    return true;
  }
} {code}

  was:
when generating secondary index record from a data column which could be of any 
type, we convert that "any type" to string via "toString". RLI probably is 
doing similar things.
h2. Read path

So SI lookup, the lookup key has to generate the matching string, despite that 
the look up key provider may provide keys in various data types. This can cause 
surprises.

 

For example, if a SI is built out of a float column, the SI will use 
float::ToString and index using string literal "10.0", yet when we do index 
lookup, we provide a lookup set of  long/int type. It means the toString will 
generate strings like "10"

10.0 == 10 is true, but "10.0".equals("10") is false

This means even if values are numerically the same, we can still fail to lookup 
the value and thus cause correctness issues. 

 

In case of spark SQL, the look up set can be generated by 

 

select * from tbl where secKey in (xxxx), where xxx can be static val or some 
subquery in the future.

select * from tbl where secKey= xxx

 

This is purely controlled by the lookup set provider, which is spark, based on 
how the query is written. So the data type is beyond our control without 
explicit type cast.

 
h2. Write path

As of today, SI update always generate records from the data column, and data 
column is of a data type consistently, unless there is a schema evolution. As 
long as no schema evolution happens, it should be fine. 

 

With schema evolution changing data type (for example, int -> float), the 
toString method will start to work differently. This means if previously SI 
track string literal of int 10 as "10", now after bump up to float, it will 
track as "10.0". Thus, even for 1 single record, we might have 2 SI records

 

 

Also for timestamp, based on different timezone, even if they are the same UTC 
time, but one may be in PST and the other is ET, they will give different 
string and cause mismatch. The list can keep going as we evaluate all possible 
data types.

 
{code:java}
/**
 * Constructs an iterator with a pair of the record key and the secondary index 
value for each record in the file slice.
 */
private static <T> ClosableIterator<Pair<String, String>> 
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
                                                                                
              HoodieTableMetaClient metaClient,                                 
                                  FileSlice fileSlice,                          
                                   Schema tableSchema,                          
                                   HoodieIndexDefinition indexDefinition,       
                                   String instantTime,                          
                              TypedProperties props,                            
                            boolean allowInflightInstants) throws IOException {

while (recordIterator.hasNext()) {
  T record = recordIterator.next();
  Object secondaryKey = readerContext.getValue(record, tableSchema, 
secondaryKeyField);
  if (secondaryKey != null) {
    nextValidRecord = Pair.of(
        readerContext.getRecordKey(record, tableSchema),
        secondaryKey.toString()
    );
    return true;
  }
} {code}


> Secondary index convert everything to string
> --------------------------------------------
>
>                 Key: HUDI-9566
>                 URL: https://issues.apache.org/jira/browse/HUDI-9566
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Davis Zhang
>            Priority: Major
>
> h1. Action items
> For SI, we only allow the following data types are involved, on both read and 
> write path * String,
>  * Integer types, including Int, BigInt, Long, Short int.
>  * timestamp
>  * Char
>  * Boolean: Due to low cardinality we should not support boolean
> It means: * [Blocker AI] We only allow SI creation on columns with above data 
> types.
>  * [Blocker AI] On the index lookup path, only when the look up set shares 
> the same data type (integer types we allow them to be different, int v.s. 
> bigint is fine), we allow index lookup. otherwise we can do any of the 
> following
>  ** [Most ideal] Fall back to no index lookup
>  ** [back up plan] Query error out
>  ** [nice to have, non blocker] best effort type casting if viable. Cast 
> failure we fall back.
> This blocker AI applies to all code path that leads to index look up, which 
> is captured by the "read path" section.
>  
> [Blocker AI] To ensure that columns with SI get a consistent toString 
> behavior, we will ban schema evolution for those columns.
>  
> *[Warning]* We only fix spark code path, for other query engine, we need 
> separate plan and owners. 
> h1. Description
> when generating secondary index record from a data column which could be of 
> any type, we convert that "any type" to string via "toString". RLI probably 
> is doing similar things.
> h2. Read path
> So SI lookup, the lookup key has to generate the matching string, despite 
> that the look up key provider may provide keys in various data types. This 
> can cause surprises.
>  
> For example, if a SI is built out of a float column, the SI will use 
> float::ToString and index using string literal "10.0", yet when we do index 
> lookup, we provide a lookup set of  long/int type. It means the toString will 
> generate strings like "10"
> 10.0 == 10 is true, but "10.0".equals("10") is false
> This means even if values are numerically the same, we can still fail to 
> lookup the value and thus cause correctness issues. 
>  
> In case of spark SQL, the look up set can be generated by 
>  
> select * from tbl where secKey in (xxxx), where xxx can be static val or some 
> subquery in the future.
> select * from tbl where secKey= xxx
>  
> This is purely controlled by the lookup set provider, which is spark, based 
> on how the query is written. So the data type is beyond our control without 
> explicit type cast.
>  
> h2. Write path
> As of today, SI update always generate records from the data column, and data 
> column is of a data type consistently, unless there is a schema evolution. As 
> long as no schema evolution happens, it should be fine. 
>  
> With schema evolution changing data type (for example, int -> float), the 
> toString method will start to work differently. This means if previously SI 
> track string literal of int 10 as "10", now after bump up to float, it will 
> track as "10.0". Thus, even for 1 single record, we might have 2 SI records
>  
>  
> Also for timestamp, based on different timezone, even if they are the same 
> UTC time, but one may be in PST and the other is ET, they will give different 
> string and cause mismatch. The list can keep going as we evaluate all 
> possible data types.
>  
> {code:java}
> /**
>  * Constructs an iterator with a pair of the record key and the secondary 
> index value for each record in the file slice.
>  */
> private static <T> ClosableIterator<Pair<String, String>> 
> createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
>                                                                               
>                 HoodieTableMetaClient metaClient,                             
>                                       FileSlice fileSlice,                    
>                                          Schema tableSchema,                  
>                                            HoodieIndexDefinition 
> indexDefinition,                                          String instantTime, 
>                                                        TypedProperties props, 
>                                                        boolean 
> allowInflightInstants) throws IOException {
> while (recordIterator.hasNext()) {
>   T record = recordIterator.next();
>   Object secondaryKey = readerContext.getValue(record, tableSchema, 
> secondaryKeyField);
>   if (secondaryKey != null) {
>     nextValidRecord = Pair.of(
>         readerContext.getRecordKey(record, tableSchema),
>         secondaryKey.toString()
>     );
>     return true;
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-9566) Secondary index convert everything to string

Reply via email to