[ 
https://issues.apache.org/jira/browse/HUDI-9566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang updated HUDI-9566:
------------------------------
    Description: 
when generating secondary index record from a data column which could be of any 
type, we convert that "any type" to string via "toString". RLI probably is 
doing similar things.
h2. Read path

So SI lookup, the lookup key has to generate the matching string, despite that 
the look up key provider may provide keys in various data types. This can cause 
surprises.

 

For example, if a SI is built out of a float column, the SI will use 
float::ToString and index using string literal "10.0", yet when we do index 
lookup, we provide a lookup set of  long/int type. It means the toString will 
generate strings like "10"

10.0 == 10 is true, but "10.0".equals("10") is false

This means even if values are numerically the same, we can still fail to lookup 
the value and thus cause correctness issues. 

 
h2. Write path

As of today, SI update always generate records from the data column, and data 
column is of a data type consistently, unless there is a schema evolution. As 
long as no schema evolution happens, it should be fine. 

 

With schema evolution changing data type (for example, int -> float), the 
toString method will start to work differently. This means if previously SI 
track string literal of int 10 as "10", now after bump up to float, it will 
track as "10.0". Thus, even for 1 single record, we might have 2 SI records

 

 
{code:java}
/**
 * Constructs an iterator with a pair of the record key and the secondary index 
value for each record in the file slice.
 */
private static <T> ClosableIterator<Pair<String, String>> 
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
                                                                                
              HoodieTableMetaClient metaClient,                                 
                                  FileSlice fileSlice,                          
                                   Schema tableSchema,                          
                                   HoodieIndexDefinition indexDefinition,       
                                   String instantTime,                          
                              TypedProperties props,                            
                            boolean allowInflightInstants) throws IOException {

while (recordIterator.hasNext()) {
  T record = recordIterator.next();
  Object secondaryKey = readerContext.getValue(record, tableSchema, 
secondaryKeyField);
  if (secondaryKey != null) {
    nextValidRecord = Pair.of(
        readerContext.getRecordKey(record, tableSchema),
        secondaryKey.toString()
    );
    return true;
  }
} {code}

  was:
when generating secondary index record from a data column which could be of any 
type, we convert that "any type" to string via "toString". RLI probably is 
doing similar things.
h2. Read path

So SI lookup, the lookup key has to generate the matching string, despite that 
the look up key provider may provide keys in various data types. This can cause 
surprises.

 

For example, if a SI is built out of a float column, the SI will use 
float::ToString and index using string literal "10.0", yet when we do index 
lookup, we provide a lookup set of  long/int type. It means the toString will 
generate strings like "10"

10.0 == 10 is true, but "10.0".equals("10") is false

This means even if values are numerically the same, we can still fail to lookup 
the value and thus cause correctness issues. 

 
h3. Write path

As of today, SI update always generate records from the data column, and data 
column is of a data type consistently, unless there is a schema evolution. As 
long as no schema evolution happens, it should be fine. 

 
{code:java}
/**
 * Constructs an iterator with a pair of the record key and the secondary index 
value for each record in the file slice.
 */
private static <T> ClosableIterator<Pair<String, String>> 
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
                                                                                
              HoodieTableMetaClient metaClient,                                 
                                  FileSlice fileSlice,                          
                                   Schema tableSchema,                          
                                   HoodieIndexDefinition indexDefinition,       
                                   String instantTime,                          
                              TypedProperties props,                            
                            boolean allowInflightInstants) throws IOException {

while (recordIterator.hasNext()) {
  T record = recordIterator.next();
  Object secondaryKey = readerContext.getValue(record, tableSchema, 
secondaryKeyField);
  if (secondaryKey != null) {
    nextValidRecord = Pair.of(
        readerContext.getRecordKey(record, tableSchema),
        secondaryKey.toString()
    );
    return true;
  }
} {code}


> Secondary index convert everything to string
> --------------------------------------------
>
>                 Key: HUDI-9566
>                 URL: https://issues.apache.org/jira/browse/HUDI-9566
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Davis Zhang
>            Priority: Major
>
> when generating secondary index record from a data column which could be of 
> any type, we convert that "any type" to string via "toString". RLI probably 
> is doing similar things.
> h2. Read path
> So SI lookup, the lookup key has to generate the matching string, despite 
> that the look up key provider may provide keys in various data types. This 
> can cause surprises.
>  
> For example, if a SI is built out of a float column, the SI will use 
> float::ToString and index using string literal "10.0", yet when we do index 
> lookup, we provide a lookup set of  long/int type. It means the toString will 
> generate strings like "10"
> 10.0 == 10 is true, but "10.0".equals("10") is false
> This means even if values are numerically the same, we can still fail to 
> lookup the value and thus cause correctness issues. 
>  
> h2. Write path
> As of today, SI update always generate records from the data column, and data 
> column is of a data type consistently, unless there is a schema evolution. As 
> long as no schema evolution happens, it should be fine. 
>  
> With schema evolution changing data type (for example, int -> float), the 
> toString method will start to work differently. This means if previously SI 
> track string literal of int 10 as "10", now after bump up to float, it will 
> track as "10.0". Thus, even for 1 single record, we might have 2 SI records
>  
>  
> {code:java}
> /**
>  * Constructs an iterator with a pair of the record key and the secondary 
> index value for each record in the file slice.
>  */
> private static <T> ClosableIterator<Pair<String, String>> 
> createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
>                                                                               
>                 HoodieTableMetaClient metaClient,                             
>                                       FileSlice fileSlice,                    
>                                          Schema tableSchema,                  
>                                            HoodieIndexDefinition 
> indexDefinition,                                          String instantTime, 
>                                                        TypedProperties props, 
>                                                        boolean 
> allowInflightInstants) throws IOException {
> while (recordIterator.hasNext()) {
>   T record = recordIterator.next();
>   Object secondaryKey = readerContext.getValue(record, tableSchema, 
> secondaryKeyField);
>   if (secondaryKey != null) {
>     nextValidRecord = Pair.of(
>         readerContext.getRecordKey(record, tableSchema),
>         secondaryKey.toString()
>     );
>     return true;
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to