xiarixiaoyao commented on pull request #4910: URL: https://github.com/apache/hudi/pull/4910#issuecomment-1080120016
now, let disscussion follow question @bvaradar @YannByron **question1**: do the SerDeHelper.Lnow, let disscussion follow question @bvaradar @YannByron **question1**: do the SerDeHelper.LATESTSCHEMA attribute of one commit file and the SAVE_SCHEMA_ACTION file save the same thing, or can they convert each other? if enable hoodie.schema.evolution.enable, will every commit persist SerDeHelper.LATESTSCHEMA in meta file ? answer: 1) not all commits, rollback/clean will not persist. T, 2) no we save latest_schema in meta file just like avro schema, and we also saved historySchema in .schema folder for query engine used and tracing the change of schema changes. Of course we can not persist lastest_schema in meta file, but his change does not affect anything. If the schema changes frequently and the schema is large, the historical schema files will be large. The efficiency of directly parsing these files is not as good as that of directly reading meta files. Therefore, this design is made. If it is inappropriate, I can delete this logic **question2**:When will commit the SAVE_SCHEMA_ACTION file? Once that schema is changed ? when we finished write operation and do DDL operation. this has been discussed in https://github.com/apache/hudi/pull/4910#discussion_r834251881 **question3**:How to make the Hudi Table with old version like 0.10 compatible with this ? If enable hoodie.schema.evolution.enable on an existed old-version hudi table, what will happen? 1)This has nothing to do with the Hudi Table version. We use the internal schema to control all evolution logic. There is no internal schema in the history table. They directly fall back to the original logic, 2)we can also use this feature to operate old hudi table, We just need to start schema evolution, 3)if we use old hudi version to query new version evolved hudi table, compatibility is not guaranteed. but if you evolution is consisent with the evolution of old version of hudi, old hudi version call still query new version evoled hudi table **question4**: this pr can work when enable hoodie.metadata.enable ? Of course, it's obvious, hoodie.metadata.enable is enabled by default question5: why we need to separate Spark3.1 and Spark3.2? See more repeated codes. so try to optimize them if indeed need to deal with spark3.1 and spark3.2 separately. anwer: This is very difficult 1)we donot support datasourceV2 on spark3.1, In other words, Hudi currently does not support all DDL operations on spark3.1 we sould inject rule to support that; howver hudi support dsV2 on spark3.2.1 2)dsV2 DDL opertion interface changed drasticly in spark3.1 and spark3.2.1 and This change will tend to be stable in the future, At least spark 3 2.1 and spark 3 3 is consistent 3)we need a new paruqet reader to support schema evolution. but the interface also changed in spark3.1 and spark3.2.1, and This change will tend to be stable in the future, At least spark 3 2.1 and spark 3 3 is consistent 4) Unlike iceberg, hudi do not implement its own parquet file reader. Iceberg directly writes the capability of schema evolution into his file reader, so Hudi must adapt to different versions at presentATESTSCHEMA attribute of one commit file and the SAVE_SCHEMA_ACTION file save the same thing, or can they convert each other? if enable hoodie.schema.evolution.enable, will every commit persist SerDeHelper.LATESTSCHEMA in meta file ? answer: 1) not all commits, rollback/clean will not persist. T, 2) no we save latest_schema in meta file just like avro schema, and we also saved historySchema in .schema folder for query engine used and tracing the change of schema changes. Of course we can not persist lastest_schema in meta file, but his change does not affect anything. If the schema changes frequently and the schema is large, the historical schema files will be large. The efficiency of directly parsing these files is not as good as that of directly reading meta files. Therefore, this design is made. If it is inappropriate, I can delete this logic **question2**:When will commit the SAVE_SCHEMA_ACTION file? Once that schema is changed ? when we finished write operation and do DDL operation. this has been discussed in https://github.com/apache/hudi/pull/4910#discussion_r834251881 **question3**:How to make the Hudi Table with old version like 0.10 compatible with this ? If enable hoodie.schema.evolution.enable on an existed old-version hudi table, what will happen? 1)This has nothing to do with the Hudi Table version. We use the internal schema to control all evolution logic. There is no internal schema in the history table. They directly fall back to the original logic, 2)we can also use this feature to operate old hudi table, We just need to start schema evolution, 3)if we use old hudi version to query new version evolved hudi table, compatibility is not guaranteed. but if you evolution is consisent with the evolution of old version of hudi, old hudi version call still query new version evoled hudi table **question4**: this pr can work when enable hoodie.metadata.enable ? Of course, it's obvious, hoodie.metadata.enable is enabled by default question5: why we need to separate Spark3.1 and Spark3.2? See more repeated codes. so try to optimize them if indeed need to deal with spark3.1 and spark3.2 separately. anwer: This is very difficult 1)we donot support datasourceV2 on spark3.1, In other words, Hudi currently does not support all DDL operations on spark3.1 we sould inject rule to support that; howver hudi support dsV2 on spark3.2.1 2)dsV2 DDL opertion interface changed drasticly in spark3.1 and spark3.2.1 and This change will tend to be stable in the future, At least spark 3 2.1 and spark 3 3 is consistent 3)we need a new paruqet reader to support schema evolution. but the interface also changed in spark3.1 and spark3.2.1, and This change will tend to be stable in the future, At least spark 3 2.1 and spark 3 3 is consistent 4) Unlike iceberg, hudi do not implement its own parquet file reader. Iceberg directly writes the capability of schema evolution into his file reader, so Hudi must adapt to different versions at present -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
