[
https://issues.apache.org/jira/browse/HUDI-9566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang updated HUDI-9566:
------------------------------
Description:
h1. Action items
For SI, we only allow the following data types are involved, on both read and
write path * String,
* Integer types, including Int, BigInt, Long, Short int.
* timestamp
* Char
* Boolean: Due to low cardinality we should not support boolean
It means: * [Blocker AI] We only allow SI creation on columns with above data
types.
* [Blocker AI] On the index lookup path, only when the look up set shares the
same data type (integer types we allow them to be different, int v.s. bigint is
fine), we allow index lookup. otherwise we can do any of the following
** [Most ideal] Fall back to no index lookup
** [back up plan] Query error out
** [nice to have, non blocker] best effort type casting if viable. Cast
failure we fall back.
This blocker AI applies to all code path that leads to index look up, which is
captured by the "read path" section.
[Blocker AI] To ensure that columns with SI get a consistent toString behavior,
we will ban schema evolution for those columns.
*[Warning]* We only fix spark code path, for other query engine, we need
separate plan and owners.
h1. Description
when generating secondary index record from a data column which could be of any
type, we convert that "any type" to string via "toString". RLI probably is
doing similar things.
h2. Read path
So SI lookup, the lookup key has to generate the matching string, despite that
the look up key provider may provide keys in various data types. This can cause
surprises.
For example, if a SI is built out of a float column, the SI will use
float::ToString and index using string literal "10.0", yet when we do index
lookup, we provide a lookup set of long/int type. It means the toString will
generate strings like "10"
10.0 == 10 is true, but "10.0".equals("10") is false
This means even if values are numerically the same, we can still fail to lookup
the value and thus cause correctness issues.
In case of spark SQL, the look up set can be generated by
select * from tbl where secKey in (xxxx), where xxx can be static val or some
subquery in the future.
select * from tbl where secKey= xxx
This is purely controlled by the lookup set provider, which is spark, based on
how the query is written. So the data type is beyond our control without
explicit type cast.
h2. Write path
As of today, SI update always generate records from the data column, and data
column is of a data type consistently, unless there is a schema evolution. As
long as no schema evolution happens, it should be fine.
With schema evolution changing data type (for example, int -> float), the
toString method will start to work differently. This means if previously SI
track string literal of int 10 as "10", now after bump up to float, it will
track as "10.0". Thus, even for 1 single record, we might have 2 SI records
Also for timestamp, based on different timezone, even if they are the same UTC
time, but one may be in PST and the other is ET, they will give different
string and cause mismatch. The list can keep going as we evaluate all possible
data types.
{code:java}
/**
* Constructs an iterator with a pair of the record key and the secondary index
value for each record in the file slice.
*/
private static <T> ClosableIterator<Pair<String, String>>
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
HoodieTableMetaClient metaClient,
FileSlice fileSlice,
Schema tableSchema,
HoodieIndexDefinition indexDefinition,
String instantTime,
TypedProperties props,
boolean allowInflightInstants) throws IOException {
while (recordIterator.hasNext()) {
T record = recordIterator.next();
Object secondaryKey = readerContext.getValue(record, tableSchema,
secondaryKeyField);
if (secondaryKey != null) {
nextValidRecord = Pair.of(
readerContext.getRecordKey(record, tableSchema),
secondaryKey.toString()
);
return true;
}
} {code}
was:
when generating secondary index record from a data column which could be of any
type, we convert that "any type" to string via "toString". RLI probably is
doing similar things.
h2. Read path
So SI lookup, the lookup key has to generate the matching string, despite that
the look up key provider may provide keys in various data types. This can cause
surprises.
For example, if a SI is built out of a float column, the SI will use
float::ToString and index using string literal "10.0", yet when we do index
lookup, we provide a lookup set of long/int type. It means the toString will
generate strings like "10"
10.0 == 10 is true, but "10.0".equals("10") is false
This means even if values are numerically the same, we can still fail to lookup
the value and thus cause correctness issues.
In case of spark SQL, the look up set can be generated by
select * from tbl where secKey in (xxxx), where xxx can be static val or some
subquery in the future.
select * from tbl where secKey= xxx
This is purely controlled by the lookup set provider, which is spark, based on
how the query is written. So the data type is beyond our control without
explicit type cast.
h2. Write path
As of today, SI update always generate records from the data column, and data
column is of a data type consistently, unless there is a schema evolution. As
long as no schema evolution happens, it should be fine.
With schema evolution changing data type (for example, int -> float), the
toString method will start to work differently. This means if previously SI
track string literal of int 10 as "10", now after bump up to float, it will
track as "10.0". Thus, even for 1 single record, we might have 2 SI records
Also for timestamp, based on different timezone, even if they are the same UTC
time, but one may be in PST and the other is ET, they will give different
string and cause mismatch. The list can keep going as we evaluate all possible
data types.
{code:java}
/**
* Constructs an iterator with a pair of the record key and the secondary index
value for each record in the file slice.
*/
private static <T> ClosableIterator<Pair<String, String>>
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
HoodieTableMetaClient metaClient,
FileSlice fileSlice,
Schema tableSchema,
HoodieIndexDefinition indexDefinition,
String instantTime,
TypedProperties props,
boolean allowInflightInstants) throws IOException {
while (recordIterator.hasNext()) {
T record = recordIterator.next();
Object secondaryKey = readerContext.getValue(record, tableSchema,
secondaryKeyField);
if (secondaryKey != null) {
nextValidRecord = Pair.of(
readerContext.getRecordKey(record, tableSchema),
secondaryKey.toString()
);
return true;
}
} {code}
> Secondary index convert everything to string
> --------------------------------------------
>
> Key: HUDI-9566
> URL: https://issues.apache.org/jira/browse/HUDI-9566
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Davis Zhang
> Priority: Major
>
> h1. Action items
> For SI, we only allow the following data types are involved, on both read and
> write path * String,
> * Integer types, including Int, BigInt, Long, Short int.
> * timestamp
> * Char
> * Boolean: Due to low cardinality we should not support boolean
> It means: * [Blocker AI] We only allow SI creation on columns with above data
> types.
> * [Blocker AI] On the index lookup path, only when the look up set shares
> the same data type (integer types we allow them to be different, int v.s.
> bigint is fine), we allow index lookup. otherwise we can do any of the
> following
> ** [Most ideal] Fall back to no index lookup
> ** [back up plan] Query error out
> ** [nice to have, non blocker] best effort type casting if viable. Cast
> failure we fall back.
> This blocker AI applies to all code path that leads to index look up, which
> is captured by the "read path" section.
>
> [Blocker AI] To ensure that columns with SI get a consistent toString
> behavior, we will ban schema evolution for those columns.
>
> *[Warning]* We only fix spark code path, for other query engine, we need
> separate plan and owners.
> h1. Description
> when generating secondary index record from a data column which could be of
> any type, we convert that "any type" to string via "toString". RLI probably
> is doing similar things.
> h2. Read path
> So SI lookup, the lookup key has to generate the matching string, despite
> that the look up key provider may provide keys in various data types. This
> can cause surprises.
>
> For example, if a SI is built out of a float column, the SI will use
> float::ToString and index using string literal "10.0", yet when we do index
> lookup, we provide a lookup set of long/int type. It means the toString will
> generate strings like "10"
> 10.0 == 10 is true, but "10.0".equals("10") is false
> This means even if values are numerically the same, we can still fail to
> lookup the value and thus cause correctness issues.
>
> In case of spark SQL, the look up set can be generated by
>
> select * from tbl where secKey in (xxxx), where xxx can be static val or some
> subquery in the future.
> select * from tbl where secKey= xxx
>
> This is purely controlled by the lookup set provider, which is spark, based
> on how the query is written. So the data type is beyond our control without
> explicit type cast.
>
> h2. Write path
> As of today, SI update always generate records from the data column, and data
> column is of a data type consistently, unless there is a schema evolution. As
> long as no schema evolution happens, it should be fine.
>
> With schema evolution changing data type (for example, int -> float), the
> toString method will start to work differently. This means if previously SI
> track string literal of int 10 as "10", now after bump up to float, it will
> track as "10.0". Thus, even for 1 single record, we might have 2 SI records
>
>
> Also for timestamp, based on different timezone, even if they are the same
> UTC time, but one may be in PST and the other is ET, they will give different
> string and cause mismatch. The list can keep going as we evaluate all
> possible data types.
>
> {code:java}
> /**
> * Constructs an iterator with a pair of the record key and the secondary
> index value for each record in the file slice.
> */
> private static <T> ClosableIterator<Pair<String, String>>
> createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
>
> HoodieTableMetaClient metaClient,
> FileSlice fileSlice,
> Schema tableSchema,
> HoodieIndexDefinition
> indexDefinition, String instantTime,
> TypedProperties props,
> boolean
> allowInflightInstants) throws IOException {
> while (recordIterator.hasNext()) {
> T record = recordIterator.next();
> Object secondaryKey = readerContext.getValue(record, tableSchema,
> secondaryKeyField);
> if (secondaryKey != null) {
> nextValidRecord = Pair.of(
> readerContext.getRecordKey(record, tableSchema),
> secondaryKey.toString()
> );
> return true;
> }
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)