Hi Adam, Just a side note, since your code is licensed under GPL, I believe it can't be included with Flink since GPL licenses are considered category X and can't be included with Apache-licensed projects [1].
Best regards, Martijn [1] https://www.apache.org/legal/resolved.html#category-x On Mon, Jan 9, 2023 at 12:47 AM Adam Leszczyński <aleszczyn...@bersler.com> wrote: > Hi Gunnar, > > Thank you very much for help. I really appreciate it. > > I believe you might be right. > But still Flink has it’s own connector to Oracle which does not require > Debezium. > I’m not an expert and don’t have a wider view how most customer sites work. > > My intention was just to clearly shot what the current state of > development is. > > I’m getting close to the point to make a decision if OpenLogReplicator > should also have > a commercial version with some enterprise features or not, or maybe reduce > my time > spent on the project and just see what happens. > > Regards, > Adam > > > > > On 6 Jan 2023, at 13:18, Gunnar Morling > <gunnar.morl...@googlemail.com.INVALID> wrote: > > > > Hey Adam, all, > > > > Just came across this thread, still remembering the good conversations > > we had around this while I was working on Debezium full-time :) > > > > Personally, I still believe the best way forward with this would be to > > add support to the Debezium connector for Oracle so it can ingest > > changes from a remote OpenLogReplicator instance via that server > > you've built. That way, you don't need to deal with any Kafka > > specifics, users would inherit the existing functionality for > > backfilling, integration with Debezium Server (i.e. for non-Kafka > > scenarios like Apache Pulsar, Kinesis, etc.), and Debezium engine > > (which is what Flink CDC is based on). The Debezium connector for > > Oracle is already built in a way that it supports multiple stream > > ingestion adapters (currently LogMiner and XStream), so adding another > > one for OLR would be rather simple. This approach would simplify > > things from a Flink (CDC) perspective a lot. > > > > I've just pinged the folks over in the Debezium community on this, it > > would be great to see progress in this matter. > > > > Best, > > > > --Gunnar > > > > > > Am Do., 5. Jan. 2023 um 20:55 Uhr schrieb Adam Leszczyński > > <aleszczyn...@bersler.com>: > >> > >> Thanks Leonard, Jark, > >> > >> I will just reply on the dev list for this topic as this is more > related with development. Sorry, I have sent on 2 lists - I don’t want to > add more chaos here. > >> > >> The answer to your question is not straight, so I will start from a > broader picture. > >> > >> Maybe first I will describe some assumptions that I have chosen while > designing OpenLogReplicator. The project is aimed to be minimalistic. It > should only contain the code that is necessary to do parsing of Oracle redo > logs. Nothing more, it should not be a fully functional replicator. So, the > targets are limited to middleware (like Kafka, Flink, some MQ). The amount > of dependencies is reduced to minimal. > >> > >> The second assumption is to make the project stateless wherever > possible. The goal is to put on HA (Kubernetes) and store state in Redis > (not yet implemented). But generally OpenLogReplicator should not handle > the information (if possible) about the position of data confirmed by the > receiver. This would allow the receiver to choose way of handling failures > (data to be duplicated on restart, idempotent message). > >> > >> The third topic is initial data load. There is plenty of available > software for that. There is absolutely no need to duplicate it in this > project. No ETL, selects, etc. My goal is just to track changes. > >> > >> The fourth assumption is to write code in C++ so that the code is fast, > and I have full control over memory. The code can fully reuse memory and > work also with machines with little memory. This allows easy compilation on > Linux, but maybe in the future also on Solaris, AIX, HP-UX, or even Windows > (if there is demand for that). I think Java is good for some solutions but > not for a binary parser which heavily works with memory and in most cases > uses zero copy approach. > >> > >> Amount of data in the output is actually defined by source database > (how much data is logged - full schema or just changed columns). I don’t > care. The user defines that what is logged by db. If just primary key and > changed columns - I can send just changed data. If someone wants full > schema in every payload - this is fine too. If schema changes - no problem, > I can provide just DDL commands and process further payloads with new > schema. > >> > >> Format of data - this is actually defined by the receiver. My first > choice was JSON. Next the Debezium guys asked me to support Protobuf. Ok, I > have spend a lot of time and extended the architecture to actually make the > code modular and allow to choose the format of the payload. The writer > module can directly produce json or protobuf payload. Actually that can be > extended to any other format if there is demand for that. Also the json > format allows many options regarding format. I generally don’t test > protobuf format code - I would treat that as a prototype because I know > nobody who would like to use it. This code was planned for Debezium request > but so far nobody cares. > >> > >> For integration with other systems, languages - this is an open case. > Actually I am here agnostic. The data that is produced for output is stored > in a buffer and can be sent to any target. This is done by the Writer > module (you can look at the code) and there is a writer for Kafka, ZeroMQ > and even plain network tcp/ip connection. I don’t understand the question > regarding to adapt that better. If I have a specification I can extend. Say > what you need. > >> > >> In such case when we have bidirectional connection not like with Kafka > - the receiver can define starting position of data (scn) of the stream > he/she wants to receive. > >> > >> You can look at the prototype code how this communication would look > like: StreamClient.cpp - but please rather treat that as a working > prototype. This is a client which just connects to OpenLogReplicator using > network, and defines the starting scn and then just receives payload. > >> > >> In case when: > >> - The connection is broken: the client would reconnect and tell the > last confirmed data and just ask for the following transactions > >> - If OpenLogReplicator crashes - after restart the client would tell > the last confirmed data and ask for the following transactions > >> - If the client crashes - the client would need to recover itself and > ask for the transactions that are after the data that is confirmed by the > client > >> > >> I assume that if the client confirms about some scn that is processed, > OpenLogReplicator can remove that from cache and it is not possible that > after reconnect the client would demand some data that it previously > declared as confirmed. > >> > >> Well, > >> This is what is currently done, some code was driven by request from > the Debezium team towards future integration, like Support for Protobuf or > put some data to the payload. But never used. > >> We have opened a ticket in their Jira for integration: > https://issues.redhat.com/projects/DBZ/issues/DBZ-2543?filter=allopenissues > But there is no progress and no feedback if they want to make integration > or not. I have made some effort to allow easier integration but I’m not > going to write a Kafka connect code for OpenLogReplicator. I just don’t > have resources for that. I think they focused on their own approach with > LogMiner, waiting for OpenLogReplicator to become more mature before any > integration would be done. If you want to depend Flink integration on the > integration with Debezium. This may never happen. > >> > >> I was focused recently mostly on making the code stable and releasing > version 1.0 and achieved that point. I am not aware of any problems with > the code that is currently working. The code is aimed to be modular and > allow easy integration, but as you mentioned there is no SDK. Actually this > is the topic that I would like to talk about. Is there reason for some SDK? > Would someone find it useful? Maybe just plain Kafka is enough. Maybe it > would be best if someone took the code and rewrote to Java? But definitely > not me - I would find that nonsense. Java code would suffer. > >> > >> What kind of interface would be best for Flink? > >> OpenLogReplicator produces payload in protobuf or json. If you want to > use for example xml it would be waste to write a converter, I would > definitely prefer to add another writer module that would just produce xml > instead. If you need certain format - this is no problem. > >> > >> But if you want to have full initial data load (snapshot) - this can’t > be done because this project is not for that. You have your own good code. > >> > >> In practice I think there would be just a few projects which could be > the receiver of data from OpenLogReplicator and there is no reason for > writing a generic SDK for everybody. > >> > >> My goal was just to start a conversation - discuss if such integration > really makes sense, or not. I really prefer simple architecture, as little > conversions of data as necessary. Not that I would give some format but you > would convert that anyway. This way replication from Oracle can be really > fast. > >> > >> I’m just about beginning to write tutorials for OpenLogReplicator and > the documentation is out of data. I have a regular daily job which I need > to pay the rent, and a family, and work on this projects just in free time > so the progress is slow. I don’t expect that to change in the future. But > in spite of that, I know companies who already use the code In production > and it works fast and stable. Client’s perspective is that it works 10 > times faster than LogMiner - but this would be dependent on the actual > case. You would need to make a benchmark and test yourself. > >> > >> > >> Regards, > >> Adam > >> > >> > >> > >> > >>> On 5 Jan 2023, at 09:41, Leonard Xu <xbjt...@gmail.com> wrote: > >>> > >>> Hi, Adam & Márton > >>> > >>> Thanks for bringing the discussion here. > >>> > >>> The Flink CDC project provides the Oracle CDC Connector, which can be > used to capture historical and transaction log data from the Oracle > database and ingest it into Flink. In the latest version 2.3, the Oracle > CDC Connector already supports the parallel-incremental snapshot algorithm > is supported, which supports parallel reading for historical data and > lock-free switching from historical reading to transaction log reading. In > the phase of capturing transaction log data, the connector uses Debezium as > the library, which supports LogMiner and XStream API to capture change > data. IIUC that OpenLogReplicator can be used as a third way. > >>> > >>> For integrating OpenLogReplicator, there are several interesting > points that we can discuss further: > >>> (1) All Flink CDC connectors do not rely on Kafka or other message > queue storage, and are directly calculated after data capture. I think the > network stream way of OpenLogReplicator needs to be adapted better. > >>> (2) The Flink CDC project is mainly developed in Java as well as > Flink. Does OpenLogReplicator provide Java SDK for easy integration? > >>> (3) If OpenLogReplicator have a plan to be integrated into the > Debezium project firstly, the Flink CDC project can easily integrate > OpenLogReplicator by bumping Debezium version. > >>> > >>> Best, > >>> Leonard > >> > >>> On 5 Jan 2023, at 04:15, Jark Wu <imj...@gmail.com> wrote: > >>> > >>> Hi Adam, > >>> > >>> Thanks for sharing this interesting project. I think it definitely is > valuable for users for better speed. > >>> > >>> I am one of the maintainers of flink-cdc-connector project. The > project offers a “oracle-cdc” connector which uses Debezium (depends on > LogMiner) as the CDC library. From the perspective of “oracle-cdc” > connector, I have some questions about OpenLogRelicator: > >>> > >>> 1) Can OpenLogReplicator provide a Java SDK to allow Flink to > communicate with Oracle server directly without deploying any other service? > >>> 2) How much overhead on Oracle compared to the LogMiner approach? > >>> 3) Did you discuss this with the Debezium community? I think Debezium > might be interested in this project as well. > >>> > >>> > >>> Best, > >>> Jark > >>> > >>>> 2023年1月5日 07:32,Adam Leszczyński <aleszczyn...@bersler.com> 写道: > >>>> > >>>> H Márton, > >>>> > >>>> Thank you very much for your answer. > >>>> > >>>> The point with Kafka makes sense. It offers huge bag of potential > connectors that could be used. > >>>> But … not everybody wants or needs Kafka. This brings additional > architectural > >>>> complication and delays, which might not be acceptable by everybody. > >>>> That’s why you do have your own connectors anyway. > >>>> > >>>> The Flink connector which reads from Oracle utilizes the LogMiner > technology, which > >>>> Is not acceptable for every user. It has big limitation regarding > speed. > >>>> You can overcome that only with a binary reader of the database redo > log (like 10 times > >>>> faster and delay even up to 50-100ms). > >>>> > >>>> The reason I am asking is not just to create some additional > connector just for fun. > >>>> My main concern is if there is actual demand from users for bigger > speed of > >>>> getting changes from the source database or having lower delay. > >>>> You can find a lot of information in the net about differences > between a log-based and > >>>> one which is using logminer technology. > >>>> > >>>> I think, that would be enough for a start. Please tell me what you > think about it. > >>>> Would anyone consider using such connector? > >>>> > >>>> Regards, > >>>> Adam Leszczyński > >>>> > >>>> > >>>>> On 4 Jan 2023, at 12:07, Márton Balassi <balassi.mar...@gmail.com> > wrote: > >>>>> > >>>>> (cc Leonard) > >>>>> > >>>>> Hi Adam, > >>>>> > >>>>> From an architectural perspective if you land the records to Kafka > or other > >>>>> message broker Flink will be able to process them, at this point I > do not > >>>>> see much merit trying to circumvent this step. > >>>>> There is a related project in the Flink space called CDC connectors > [1], I > >>>>> highly encourage you to check that out for context and ccd Leonard > one of > >>>>> its primary maintainers. > >>>>> > >>>>> [1] https://github.com/ververica/flink-cdc-connectors/ > >>>>> > >>>>> On Tue, Jan 3, 2023 at 8:40 PM Adam Leszczyński < > aleszczyn...@bersler.com> > >>>>> wrote: > >>>>> > >>>>>> Hi Flink Team, > >>>>>> > >>>>>> I’m the author of OpenLogReplictor - open source parser of Oracle > redo > >>>>>> logs which allows to send transactions > >>>>>> to some message bus. Currently the sink that is implemented is just > text > >>>>>> file or Kafka topic. > >>>>>> Also transactions can be sent using plain tcp connection or some > message > >>>>>> queue like ZeroMQ. > >>>>>> Code is GPL and all versions from 11.2 are supported. No LogMiner > needed. > >>>>>> > >>>>>> Transactions can be sent using json or protobuf format. Currently > the code > >>>>>> has reached GA and is actually used in production. > >>>>>> > >>>>>> The architecture is modular and allows very easily to add other > sinks like > >>>>>> for example Apache Flink. > >>>>>> Actually I’m going towards approach that OpenLogReplicator could > used > >>>>>> Kubernetes and work in HA. > >>>>>> > >>>>>> Well… that is the general direction. Do you think there could some > >>>>>> application of this soft with Apache Flink? > >>>>>> For example very easily there could be some client which could > connect to > >>>>>> OpenLogReplicator using tcp connection > >>>>>> and get transactions and just send them to Apache Flink. An example > of > >>>>>> such client is also present in GitHub repo. > >>>>>> https://github.com/bersler/OpenLogReplicator > >>>>>> > >>>>>> Is there any rational for such integration? Or just a waste of time > cause > >>>>>> nobody would use it anyway? > >>>>>> > >>>>>> Kind regards, > >>>>>> Adam Leszczyński > >>>>>> > >>>>>> > >>>> > >>> > >> > >