Paimon?

Andrew Otto Mon, 28 Oct 2024 07:00:44 -0700

> Maybe you can simply the pipeline by synchronizing data from MariaDB to
Paimon directly.
> Paimon sink is already available in FlinkCDC, so we can directly write
data from MariaDB to Paimon


Yes! I have accomplished this, and will explore it more.  However, having
Kafka in the middle has some advantages:

- A distributed log buffer: MariaDB read and downstream (e.g. Paimon table)
write jobs are decoupled.
- CDC data in Kafka means it can be re-used for things. We can sink it to
Paimon or Iceberg tables, and we could use kafka + debezium json source to
implement other streaming jobs.

I added this to the JIRA. Thank you!





On Mon, Oct 28, 2024 at 6:52 AM Yanquan Lv <decq12y...@gmail.com> wrote:

> Hi, Andrew.
> Yeah, currently, the output from Kafka pipeline didn't contain schema
> info, So in Paimon action, it will be considered as a String type.
> Your suggestion is very meaningful. I plan to support this feature in the
> next version of FlinkCDC (FlinkCDC 3.3), which may be enabled through a
> parameter[1].
>
> And Paimon sink is already available in FlinkCDC, so we can directly write
> data from MariaDB to Paimon to reduce the number of components and links
> that need to be maintained, We will also follow up on any issues
> encountered in Paimon Pipeline sink.
>
> [1] https://issues.apache.org/jira/browse/FLINK-36611
>
>
> 2024年10月28日 18:22，Yanquan Lv <decq12y...@gmail.com> 写道：
>
> Hi, Andrew.
> Yeah, currently, the output from Kafka pipeline didn't contain schema
> info, So in Paimon action, it will be considered as a String type.
> Your suggestion is very meaningful. I plan to support this feature in the
> next version of FlinkCDC (FlinkCDC 3.3), which may be enabled through a
> parameter[1].
>
> And Paimon sink is already available in FlinkCDC, so we can directly write
> data from MariaDB to Paimon to reduce the number of components and links
> that need to be maintained, We will also follow up on any issues
> encountered in Paimon Pipeline sink.
>
> [1] https://issues.apache.org/jira/browse/FLINK-36611
>
>
> 2024年10月26日 11:32，Andrew Otto <o...@wikimedia.org> 写道：
>
> Hi!
>
> I really like Flink CDC's pipeline connectors
> <https://nightlies.apache.org/flink/flink-cdc-docs-release-3.2/docs/connectors/pipeline-connectors/overview/>!
> So simple!
> I also like Paimon's CDC ingestion action CLI
> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/overview/>.
>
> I like these because I don't need to specify the schemas; they are
> inferred from the source.  I also like the schema evolution support!
>
> Paimon's recent Iceberg Compatibility
> <https://paimon.apache.org/docs/master/migration/iceberg-compatibility/>
> mode looks cool!
>
> I'd like to accomplish the following:
>
>    - MariaDB CDC -> Kafka
>    - Kafka -> Paimon
>    - Query with Spark+Iceberg
>
> I can do MariaDB CDC -> Kafka with the flink-cdc pipeline connector
> <https://nightlies.apache.org/flink/flink-cdc-docs-release-3.2/docs/connectors/pipeline-connectors/kafka/>
> .
> However, I'm stuck on Kafka -> Paimon.
>
> Paimon's Kafka CDC action docs
> <https://paimon.apache.org/docs/0.9/flink/cdc-ingestion/kafka-cdc/> say:
> > Usually, debezium-json contains ‘schema’ field, from which Paimon will
> retrieve data types. Make sure your debezium json has this field, or Paimon
> will use ‘STRING’ type.
>
> However, the messages generated by flink-cdc's pipeline connector do not
> have a schema field.  They look like:
>
> {
>   "before": null,
>   "after": {
>     "rev_id": 37,
>      ...
>   },
>   "op": "c"
> }
>
> The only reference I can find to a schema field is in the debezium-json
> format documentation
> <https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/connectors/table/formats/debezium/#debezium-json-ddl>
> .
> > users may setup the Debezium Kafka Connect with the Kafka configuration
> 'value.converter.schemas.enable' enabled to include schema in the message
>
> This leads me to believe that the schema field that Paimon's doc is referring
> to is added not by debezium or flink-cdc, but by Kafka Connect when it is
> used with Debezium proper to write CDC messages to Kafka.
>
>    - Does this mean that the CDC messages generated by the flink-cdc
>    kafka pipeline connector are not compatible with Paimon's
>    kafka_sync_database action?
>    - Or, is there a way to cause flink-cdc pipeline connectors to include
>    schema in the message?
>
>
> I could be misunderstanding something. Is Kafka Connect+Debezium used by
> Flink to support debezium-json formatted messages? I tried passing 
> properties.value.converter.schemas.enable:
> true to the flink-cdc pipeline kafka sink but that did not work (as
> expected).
>
> Thank you!
>
> -Andrew Otto
>  Wikimedia Foundation
>
> P.S. Context for what we are trying to do is here: T373144 [SPIKE] Learn
> and document how to use Flink-CDC from MediaWiki MariaDB locally
> <https://phabricator.wikimedia.org/T373144>
>
>
>
>
>
>

Re: Flink CDC -> Kafka -> Paimon?

Reply via email to