[
https://issues.apache.org/jira/browse/HUDI-9566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang closed HUDI-9566.
-----------------------------
Resolution: Fixed
code is merged
> Secondary index convert everything to string
> --------------------------------------------
>
> Key: HUDI-9566
> URL: https://issues.apache.org/jira/browse/HUDI-9566
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Davis Zhang
> Priority: Major
> Labels: pull-request-available
>
> h1. Action items
> For SI, we only allow the following data types are involved, on both read and
> write path * String,
> * Integer types, including Int, BigInt, Long, Short int.
> * timestamp
> * Char
> * Boolean: Due to low cardinality we should not support boolean
> It means: * [Blocker AI] We only allow SI creation on columns with above data
> types.
> * [Blocker AI] On the index lookup path, only when the look up set shares
> the same data type (integer types we allow them to be different, int v.s.
> bigint is fine), we allow index lookup. otherwise we can do any of the
> following
> ** [Most ideal] Fall back to no index lookup
> ** [back up plan] Query error out
> ** [nice to have, non blocker] best effort type casting if viable. Cast
> failure we fall back.
> This blocker AI applies to all code path that leads to index look up, which
> is captured by the "read path" section.
>
> [Blocker AI] To ensure that columns with SI get a consistent toString
> behavior, we will ban schema evolution for those columns.
>
> *[Warning]* We only fix spark code path, for other query engine, we need
> separate plan and owners.
> h1. Description
> when generating secondary index record from a data column which could be of
> any type, we convert that "any type" to string via "toString". RLI probably
> is doing similar things.
> h2. Read path
> So SI lookup, the lookup key has to generate the matching string, despite
> that the look up key provider may provide keys in various data types. This
> can cause surprises.
>
> For example, if a SI is built out of a float column, the SI will use
> float::ToString and index using string literal "10.0", yet when we do index
> lookup, we provide a lookup set of long/int type. It means the toString will
> generate strings like "10"
> 10.0 == 10 is true, but "10.0".equals("10") is false
> This means even if values are numerically the same, we can still fail to
> lookup the value and thus cause correctness issues.
>
> In case of spark SQL, the look up set can be generated by
>
> select * from tbl where secKey in (xxxx), where xxx can be static val or some
> subquery in the future.
> select * from tbl where secKey= xxx
>
> This is purely controlled by the lookup set provider, which is spark, based
> on how the query is written. So the data type is beyond our control without
> explicit type cast.
>
> h2. Write path
> As of today, SI update always generate records from the data column, and data
> column is of a data type consistently, unless there is a schema evolution. As
> long as no schema evolution happens, it should be fine.
>
> With schema evolution changing data type (for example, int -> float), the
> toString method will start to work differently. This means if previously SI
> track string literal of int 10 as "10", now after bump up to float, it will
> track as "10.0". Thus, even for 1 single record, we might have 2 SI records
>
>
> Also for timestamp, based on different timezone, even if they are the same
> UTC time, but one may be in PST and the other is ET, they will give different
> string and cause mismatch. The list can keep going as we evaluate all
> possible data types.
>
> {code:java}
> /**
> * Constructs an iterator with a pair of the record key and the secondary
> index value for each record in the file slice.
> */
> private static <T> ClosableIterator<Pair<String, String>>
> createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
>
> HoodieTableMetaClient metaClient,
> FileSlice fileSlice,
> Schema tableSchema,
> HoodieIndexDefinition
> indexDefinition, String instantTime,
> TypedProperties props,
> boolean
> allowInflightInstants) throws IOException {
> while (recordIterator.hasNext()) {
> T record = recordIterator.next();
> Object secondaryKey = readerContext.getValue(record, tableSchema,
> secondaryKeyField);
> if (secondaryKey != null) {
> nextValidRecord = Pair.of(
> readerContext.getRecordKey(record, tableSchema),
> secondaryKey.toString()
> );
> return true;
> }
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)