Hi Martijn, This is not reasonable to merge Flink and OpenLogReplicator code. A simple network connection is enough and leave the projects separate. They can remain distinct. You can have client/SDK code for OpenLogReplicator (like StreamClient.cpp in the code) which would be licensed with Apache 2.0. That can be a separate project.
Regards, Adam > On 9 Jan 2023, at 09:31, Martijn Visser <martijnvis...@apache.org> wrote: > > Hi Adam, > > Just a side note, since your code is licensed under GPL, I believe it can't > be included with Flink since GPL licenses are considered category X and > can't be included with Apache-licensed projects [1]. > > Best regards, > > Martijn > > [1] https://www.apache.org/legal/resolved.html#category-x > > On Mon, Jan 9, 2023 at 12:47 AM Adam Leszczyński <aleszczyn...@bersler.com> > wrote: > >> Hi Gunnar, >> >> Thank you very much for help. I really appreciate it. >> >> I believe you might be right. >> But still Flink has it’s own connector to Oracle which does not require >> Debezium. >> I’m not an expert and don’t have a wider view how most customer sites work. >> >> My intention was just to clearly shot what the current state of >> development is. >> >> I’m getting close to the point to make a decision if OpenLogReplicator >> should also have >> a commercial version with some enterprise features or not, or maybe reduce >> my time >> spent on the project and just see what happens. >> >> Regards, >> Adam >> >> >> >>> On 6 Jan 2023, at 13:18, Gunnar Morling >> <gunnar.morl...@googlemail.com.INVALID> wrote: >>> >>> Hey Adam, all, >>> >>> Just came across this thread, still remembering the good conversations >>> we had around this while I was working on Debezium full-time :) >>> >>> Personally, I still believe the best way forward with this would be to >>> add support to the Debezium connector for Oracle so it can ingest >>> changes from a remote OpenLogReplicator instance via that server >>> you've built. That way, you don't need to deal with any Kafka >>> specifics, users would inherit the existing functionality for >>> backfilling, integration with Debezium Server (i.e. for non-Kafka >>> scenarios like Apache Pulsar, Kinesis, etc.), and Debezium engine >>> (which is what Flink CDC is based on). The Debezium connector for >>> Oracle is already built in a way that it supports multiple stream >>> ingestion adapters (currently LogMiner and XStream), so adding another >>> one for OLR would be rather simple. This approach would simplify >>> things from a Flink (CDC) perspective a lot. >>> >>> I've just pinged the folks over in the Debezium community on this, it >>> would be great to see progress in this matter. >>> >>> Best, >>> >>> --Gunnar >>> >>> >>> Am Do., 5. Jan. 2023 um 20:55 Uhr schrieb Adam Leszczyński >>> <aleszczyn...@bersler.com>: >>>> >>>> Thanks Leonard, Jark, >>>> >>>> I will just reply on the dev list for this topic as this is more >> related with development. Sorry, I have sent on 2 lists - I don’t want to >> add more chaos here. >>>> >>>> The answer to your question is not straight, so I will start from a >> broader picture. >>>> >>>> Maybe first I will describe some assumptions that I have chosen while >> designing OpenLogReplicator. The project is aimed to be minimalistic. It >> should only contain the code that is necessary to do parsing of Oracle redo >> logs. Nothing more, it should not be a fully functional replicator. So, the >> targets are limited to middleware (like Kafka, Flink, some MQ). The amount >> of dependencies is reduced to minimal. >>>> >>>> The second assumption is to make the project stateless wherever >> possible. The goal is to put on HA (Kubernetes) and store state in Redis >> (not yet implemented). But generally OpenLogReplicator should not handle >> the information (if possible) about the position of data confirmed by the >> receiver. This would allow the receiver to choose way of handling failures >> (data to be duplicated on restart, idempotent message). >>>> >>>> The third topic is initial data load. There is plenty of available >> software for that. There is absolutely no need to duplicate it in this >> project. No ETL, selects, etc. My goal is just to track changes. >>>> >>>> The fourth assumption is to write code in C++ so that the code is fast, >> and I have full control over memory. The code can fully reuse memory and >> work also with machines with little memory. This allows easy compilation on >> Linux, but maybe in the future also on Solaris, AIX, HP-UX, or even Windows >> (if there is demand for that). I think Java is good for some solutions but >> not for a binary parser which heavily works with memory and in most cases >> uses zero copy approach. >>>> >>>> Amount of data in the output is actually defined by source database >> (how much data is logged - full schema or just changed columns). I don’t >> care. The user defines that what is logged by db. If just primary key and >> changed columns - I can send just changed data. If someone wants full >> schema in every payload - this is fine too. If schema changes - no problem, >> I can provide just DDL commands and process further payloads with new >> schema. >>>> >>>> Format of data - this is actually defined by the receiver. My first >> choice was JSON. Next the Debezium guys asked me to support Protobuf. Ok, I >> have spend a lot of time and extended the architecture to actually make the >> code modular and allow to choose the format of the payload. The writer >> module can directly produce json or protobuf payload. Actually that can be >> extended to any other format if there is demand for that. Also the json >> format allows many options regarding format. I generally don’t test >> protobuf format code - I would treat that as a prototype because I know >> nobody who would like to use it. This code was planned for Debezium request >> but so far nobody cares. >>>> >>>> For integration with other systems, languages - this is an open case. >> Actually I am here agnostic. The data that is produced for output is stored >> in a buffer and can be sent to any target. This is done by the Writer >> module (you can look at the code) and there is a writer for Kafka, ZeroMQ >> and even plain network tcp/ip connection. I don’t understand the question >> regarding to adapt that better. If I have a specification I can extend. Say >> what you need. >>>> >>>> In such case when we have bidirectional connection not like with Kafka >> - the receiver can define starting position of data (scn) of the stream >> he/she wants to receive. >>>> >>>> You can look at the prototype code how this communication would look >> like: StreamClient.cpp - but please rather treat that as a working >> prototype. This is a client which just connects to OpenLogReplicator using >> network, and defines the starting scn and then just receives payload. >>>> >>>> In case when: >>>> - The connection is broken: the client would reconnect and tell the >> last confirmed data and just ask for the following transactions >>>> - If OpenLogReplicator crashes - after restart the client would tell >> the last confirmed data and ask for the following transactions >>>> - If the client crashes - the client would need to recover itself and >> ask for the transactions that are after the data that is confirmed by the >> client >>>> >>>> I assume that if the client confirms about some scn that is processed, >> OpenLogReplicator can remove that from cache and it is not possible that >> after reconnect the client would demand some data that it previously >> declared as confirmed. >>>> >>>> Well, >>>> This is what is currently done, some code was driven by request from >> the Debezium team towards future integration, like Support for Protobuf or >> put some data to the payload. But never used. >>>> We have opened a ticket in their Jira for integration: >> https://issues.redhat.com/projects/DBZ/issues/DBZ-2543?filter=allopenissues >> But there is no progress and no feedback if they want to make integration >> or not. I have made some effort to allow easier integration but I’m not >> going to write a Kafka connect code for OpenLogReplicator. I just don’t >> have resources for that. I think they focused on their own approach with >> LogMiner, waiting for OpenLogReplicator to become more mature before any >> integration would be done. If you want to depend Flink integration on the >> integration with Debezium. This may never happen. >>>> >>>> I was focused recently mostly on making the code stable and releasing >> version 1.0 and achieved that point. I am not aware of any problems with >> the code that is currently working. The code is aimed to be modular and >> allow easy integration, but as you mentioned there is no SDK. Actually this >> is the topic that I would like to talk about. Is there reason for some SDK? >> Would someone find it useful? Maybe just plain Kafka is enough. Maybe it >> would be best if someone took the code and rewrote to Java? But definitely >> not me - I would find that nonsense. Java code would suffer. >>>> >>>> What kind of interface would be best for Flink? >>>> OpenLogReplicator produces payload in protobuf or json. If you want to >> use for example xml it would be waste to write a converter, I would >> definitely prefer to add another writer module that would just produce xml >> instead. If you need certain format - this is no problem. >>>> >>>> But if you want to have full initial data load (snapshot) - this can’t >> be done because this project is not for that. You have your own good code. >>>> >>>> In practice I think there would be just a few projects which could be >> the receiver of data from OpenLogReplicator and there is no reason for >> writing a generic SDK for everybody. >>>> >>>> My goal was just to start a conversation - discuss if such integration >> really makes sense, or not. I really prefer simple architecture, as little >> conversions of data as necessary. Not that I would give some format but you >> would convert that anyway. This way replication from Oracle can be really >> fast. >>>> >>>> I’m just about beginning to write tutorials for OpenLogReplicator and >> the documentation is out of data. I have a regular daily job which I need >> to pay the rent, and a family, and work on this projects just in free time >> so the progress is slow. I don’t expect that to change in the future. But >> in spite of that, I know companies who already use the code In production >> and it works fast and stable. Client’s perspective is that it works 10 >> times faster than LogMiner - but this would be dependent on the actual >> case. You would need to make a benchmark and test yourself. >>>> >>>> >>>> Regards, >>>> Adam >>>> >>>> >>>> >>>> >>>>> On 5 Jan 2023, at 09:41, Leonard Xu <xbjt...@gmail.com> wrote: >>>>> >>>>> Hi, Adam & Márton >>>>> >>>>> Thanks for bringing the discussion here. >>>>> >>>>> The Flink CDC project provides the Oracle CDC Connector, which can be >> used to capture historical and transaction log data from the Oracle >> database and ingest it into Flink. In the latest version 2.3, the Oracle >> CDC Connector already supports the parallel-incremental snapshot algorithm >> is supported, which supports parallel reading for historical data and >> lock-free switching from historical reading to transaction log reading. In >> the phase of capturing transaction log data, the connector uses Debezium as >> the library, which supports LogMiner and XStream API to capture change >> data. IIUC that OpenLogReplicator can be used as a third way. >>>>> >>>>> For integrating OpenLogReplicator, there are several interesting >> points that we can discuss further: >>>>> (1) All Flink CDC connectors do not rely on Kafka or other message >> queue storage, and are directly calculated after data capture. I think the >> network stream way of OpenLogReplicator needs to be adapted better. >>>>> (2) The Flink CDC project is mainly developed in Java as well as >> Flink. Does OpenLogReplicator provide Java SDK for easy integration? >>>>> (3) If OpenLogReplicator have a plan to be integrated into the >> Debezium project firstly, the Flink CDC project can easily integrate >> OpenLogReplicator by bumping Debezium version. >>>>> >>>>> Best, >>>>> Leonard >>>> >>>>> On 5 Jan 2023, at 04:15, Jark Wu <imj...@gmail.com> wrote: >>>>> >>>>> Hi Adam, >>>>> >>>>> Thanks for sharing this interesting project. I think it definitely is >> valuable for users for better speed. >>>>> >>>>> I am one of the maintainers of flink-cdc-connector project. The >> project offers a “oracle-cdc” connector which uses Debezium (depends on >> LogMiner) as the CDC library. From the perspective of “oracle-cdc” >> connector, I have some questions about OpenLogRelicator: >>>>> >>>>> 1) Can OpenLogReplicator provide a Java SDK to allow Flink to >> communicate with Oracle server directly without deploying any other service? >>>>> 2) How much overhead on Oracle compared to the LogMiner approach? >>>>> 3) Did you discuss this with the Debezium community? I think Debezium >> might be interested in this project as well. >>>>> >>>>> >>>>> Best, >>>>> Jark >>>>> >>>>>> 2023年1月5日 07:32,Adam Leszczyński <aleszczyn...@bersler.com> 写道: >>>>>> >>>>>> H Márton, >>>>>> >>>>>> Thank you very much for your answer. >>>>>> >>>>>> The point with Kafka makes sense. It offers huge bag of potential >> connectors that could be used. >>>>>> But … not everybody wants or needs Kafka. This brings additional >> architectural >>>>>> complication and delays, which might not be acceptable by everybody. >>>>>> That’s why you do have your own connectors anyway. >>>>>> >>>>>> The Flink connector which reads from Oracle utilizes the LogMiner >> technology, which >>>>>> Is not acceptable for every user. It has big limitation regarding >> speed. >>>>>> You can overcome that only with a binary reader of the database redo >> log (like 10 times >>>>>> faster and delay even up to 50-100ms). >>>>>> >>>>>> The reason I am asking is not just to create some additional >> connector just for fun. >>>>>> My main concern is if there is actual demand from users for bigger >> speed of >>>>>> getting changes from the source database or having lower delay. >>>>>> You can find a lot of information in the net about differences >> between a log-based and >>>>>> one which is using logminer technology. >>>>>> >>>>>> I think, that would be enough for a start. Please tell me what you >> think about it. >>>>>> Would anyone consider using such connector? >>>>>> >>>>>> Regards, >>>>>> Adam Leszczyński >>>>>> >>>>>> >>>>>>> On 4 Jan 2023, at 12:07, Márton Balassi <balassi.mar...@gmail.com> >> wrote: >>>>>>> >>>>>>> (cc Leonard) >>>>>>> >>>>>>> Hi Adam, >>>>>>> >>>>>>> From an architectural perspective if you land the records to Kafka >> or other >>>>>>> message broker Flink will be able to process them, at this point I >> do not >>>>>>> see much merit trying to circumvent this step. >>>>>>> There is a related project in the Flink space called CDC connectors >> [1], I >>>>>>> highly encourage you to check that out for context and ccd Leonard >> one of >>>>>>> its primary maintainers. >>>>>>> >>>>>>> [1] https://github.com/ververica/flink-cdc-connectors/ >>>>>>> >>>>>>> On Tue, Jan 3, 2023 at 8:40 PM Adam Leszczyński < >> aleszczyn...@bersler.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Flink Team, >>>>>>>> >>>>>>>> I’m the author of OpenLogReplictor - open source parser of Oracle >> redo >>>>>>>> logs which allows to send transactions >>>>>>>> to some message bus. Currently the sink that is implemented is just >> text >>>>>>>> file or Kafka topic. >>>>>>>> Also transactions can be sent using plain tcp connection or some >> message >>>>>>>> queue like ZeroMQ. >>>>>>>> Code is GPL and all versions from 11.2 are supported. No LogMiner >> needed. >>>>>>>> >>>>>>>> Transactions can be sent using json or protobuf format. Currently >> the code >>>>>>>> has reached GA and is actually used in production. >>>>>>>> >>>>>>>> The architecture is modular and allows very easily to add other >> sinks like >>>>>>>> for example Apache Flink. >>>>>>>> Actually I’m going towards approach that OpenLogReplicator could >> used >>>>>>>> Kubernetes and work in HA. >>>>>>>> >>>>>>>> Well… that is the general direction. Do you think there could some >>>>>>>> application of this soft with Apache Flink? >>>>>>>> For example very easily there could be some client which could >> connect to >>>>>>>> OpenLogReplicator using tcp connection >>>>>>>> and get transactions and just send them to Apache Flink. An example >> of >>>>>>>> such client is also present in GitHub repo. >>>>>>>> https://github.com/bersler/OpenLogReplicator >>>>>>>> >>>>>>>> Is there any rational for such integration? Or just a waste of time >> cause >>>>>>>> nobody would use it anyway? >>>>>>>> >>>>>>>> Kind regards, >>>>>>>> Adam Leszczyński >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >> >>