watermelon12138 commented on pull request #4925: URL: https://github.com/apache/hudi/pull/4925#issuecomment-1055454800
> One high level question, is there any limitation or pre requisite in terms of different schemas from different source topics that users should be aware of when using this new feature? What happens when one source schema has few fields which are not present in second source schema and so on. Does that lead to data loss? How are we handling that? @pratyakshsharma Good question! This new feature is designed to support the service scenario in which multiple sources are injected to one sink table. When the schemas of multiple sources are inconsistent, we must configure the independent schema and transformer for each source to convert the schema of source to the schema of sink table so that source data can be written to the sink table. For example, We can configure hoodie.deltastreamer.schemaprovider.source.schema.file or hoodie.deltastreamer.source.schemaProviderClassName to specify the schema of each source, and then configure hoodie.deltastreamer.source.transformerClassNames or hoodie.deltastreamer.transformer.sql to convert the schema of source to the schema of sink table. I highly recommend the hoodie.deltastreamer.transformer.sql configuration, which can associate source data with Hive tables, for example, join. The preceding method is helpful for resolving schema inconsistency issues. I look forward to hearing from you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org