[GitHub] [hudi] xiarixiaoyao commented on pull request #4910: [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark

GitBox Sun, 27 Mar 2022 19:38:02 -0700


xiarixiaoyao commented on pull request #4910:
URL: https://github.com/apache/hudi/pull/4910#issuecomment-1080120016



   now， let disscussion follow question  @bvaradar @YannByron 
   **question1**： do the SerDeHelper.Lnow， let disscussion follow question  
@bvaradar @YannByron 
   **question1**： do the SerDeHelper.LATESTSCHEMA attribute of one commit file 
and the SAVE_SCHEMA_ACTION file save the same thing, or can they convert each 
other? if enable hoodie.schema.evolution.enable, will every commit persist 
SerDeHelper.LATESTSCHEMA in meta file ?
   answer：
   
   1) not all  commits,  rollback/clean will not persist. T, 
   2) no we save latest_schema in meta file just like avro schema,  and we also 
saved historySchema in .schema folder for query engine used and tracing the 
change of schema changes.  Of course we can not persist lastest_schema in meta 
file, but his change does not affect anything.
   If the schema changes frequently and the schema is large, the historical 
schema files will be large. The efficiency of directly parsing these files is 
not as good as that of directly reading meta files. Therefore, this design is 
made. If it is inappropriate, I can delete this logic
   
   **question2**：When will commit the SAVE_SCHEMA_ACTION file? Once that schema 
is changed ?
   when we finished write operation and do DDL operation.
   this has  been discussed in 
https://github.com/apache/hudi/pull/4910#discussion_r834251881
   
   **question3**：How to make the Hudi Table with old version like 0.10 
compatible with this ? If enable hoodie.schema.evolution.enable on an existed 
old-version hudi table, what will happen?
   1）This has nothing to do with the Hudi Table version. We use the internal 
schema to control all evolution logic. There is no internal schema in the 
history table. They directly fall back to the original logic，
   2）we can also use this  feature to operate old hudi table,  We just need to 
start schema evolution,
   3）if we use old hudi version to query new version evolved hudi table,  
compatibility is not guaranteed.  but if you evolution is consisent with the 
evolution of old version of hudi， old hudi version call still query new version 
evoled hudi table
   **question4**： 
   this pr can work when enable hoodie.metadata.enable ?
   Of course, it's obvious， hoodie.metadata.enable is enabled by default
   
   question5：
   why we need to separate Spark3.1 and Spark3.2? See more repeated codes. so 
try to optimize them if indeed need to deal with spark3.1 and spark3.2 
separately.
   anwer： This is very difficult
   1）we donot support datasourceV2 on spark3.1， In other words, Hudi currently 
does not support all DDL operations on spark3.1 we sould inject rule to support 
that;  howver hudi  support dsV2 on spark3.2.1
   2）dsV2 DDL opertion  interface changed drasticly in spark3.1 and spark3.2.1  
and This change will tend to be stable in the future, At least spark 3 2.1 and 
spark 3 3 is consistent
   3）we need a new paruqet reader to support schema evolution.  but the 
interface also changed in spark3.1 and spark3.2.1， and This change will tend to 
be stable in the future, At least spark 3 2.1 and spark 3 3 is consistent
   4) Unlike iceberg, hudi do not implement its own parquet file reader. 
Iceberg directly writes the capability of schema evolution into his file 
reader, so Hudi must adapt to different versions at presentATESTSCHEMA 
attribute of one commit file and the SAVE_SCHEMA_ACTION file save the same 
thing, or can they convert each other? if enable 
hoodie.schema.evolution.enable, will every commit persist 
SerDeHelper.LATESTSCHEMA in meta file ?
   answer：
   
   1) not all  commits,  rollback/clean will not persist. T, 
   2) no we save latest_schema in meta file just like avro schema,  and we also 
saved historySchema in .schema folder for query engine used and tracing the 
change of schema changes.  Of course we can not persist lastest_schema in meta 
file, but his change does not affect anything.
   If the schema changes frequently and the schema is large, the historical 
schema files will be large. The efficiency of directly parsing these files is 
not as good as that of directly reading meta files. Therefore, this design is 
made. If it is inappropriate, I can delete this logic
   
   **question2**：When will commit the SAVE_SCHEMA_ACTION file? Once that schema 
is changed ?
   when we finished write operation and do DDL operation.
   this has  been discussed in 
https://github.com/apache/hudi/pull/4910#discussion_r834251881
   
   **question3**：How to make the Hudi Table with old version like 0.10 
compatible with this ? If enable hoodie.schema.evolution.enable on an existed 
old-version hudi table, what will happen?
   1）This has nothing to do with the Hudi Table version. We use the internal 
schema to control all evolution logic. There is no internal schema in the 
history table. They directly fall back to the original logic，
   2）we can also use this  feature to operate old hudi table,  We just need to 
start schema evolution,
   3）if we use old hudi version to query new version evolved hudi table,  
compatibility is not guaranteed.  but if you evolution is consisent with the 
evolution of old version of hudi， old hudi version call still query new version 
evoled hudi table
   **question4**： 
   this pr can work when enable hoodie.metadata.enable ?
   Of course, it's obvious， hoodie.metadata.enable is enabled by default
   
   question5：
   why we need to separate Spark3.1 and Spark3.2? See more repeated codes. so 
try to optimize them if indeed need to deal with spark3.1 and spark3.2 
separately.
   anwer： This is very difficult
   1）we donot support datasourceV2 on spark3.1， In other words, Hudi currently 
does not support all DDL operations on spark3.1 we sould inject rule to support 
that;  howver hudi  support dsV2 on spark3.2.1
   2）dsV2 DDL opertion  interface changed drasticly in spark3.1 and spark3.2.1  
and This change will tend to be stable in the future, At least spark 3 2.1 and 
spark 3 3 is consistent
   3）we need a new paruqet reader to support schema evolution.  but the 
interface also changed in spark3.1 and spark3.2.1， and This change will tend to 
be stable in the future, At least spark 3 2.1 and spark 3 3 is consistent
   4) Unlike iceberg, hudi do not implement its own parquet file reader. 
Iceberg directly writes the capability of schema evolution into his file 
reader, so Hudi must adapt to different versions at present


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xiarixiaoyao commented on pull request #4910: [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark

Reply via email to