Re: CDC from Oracle database reading directly logs - integration with OpenLogReplicator

Gunnar Morling Fri, 06 Jan 2023 04:19:10 -0800

Hey Adam, all,

Just came across this thread, still remembering the good conversations
we had around this while I was working on Debezium full-time :)


Personally, I still believe the best way forward with this would be to
add support to the Debezium connector for Oracle so it can ingest
changes from a remote OpenLogReplicator instance via that server
you've built. That way, you don't need to deal with any Kafka
specifics, users would inherit the existing functionality for
backfilling, integration with Debezium Server (i.e. for non-Kafka
scenarios like Apache Pulsar, Kinesis, etc.), and Debezium engine
(which is what Flink CDC is based on). The Debezium connector for
Oracle is already built in a way that it supports multiple stream
ingestion adapters (currently LogMiner and XStream), so adding another
one for OLR would be rather simple. This approach would simplify
things from a Flink (CDC) perspective a lot.

I've just pinged the folks over in the Debezium community on this, it
would be great to see progress in this matter.

Best,

--Gunnar


Am Do., 5. Jan. 2023 um 20:55 Uhr schrieb Adam Leszczyński
<aleszczyn...@bersler.com>:
>
> Thanks Leonard, Jark,
>
> I will just reply on the dev list for this topic as this is more related with 
> development. Sorry, I have sent on 2 lists - I don’t want to add more chaos 
> here.
>
> The answer to your question is not straight, so I will start from a broader 
> picture.
>
> Maybe first I will describe some assumptions that I have chosen while 
> designing OpenLogReplicator. The project is aimed to be minimalistic. It 
> should only contain the code that is necessary to do parsing of Oracle redo 
> logs. Nothing more, it should not be a fully functional replicator. So, the 
> targets are limited to middleware (like Kafka, Flink, some MQ). The amount of 
> dependencies is reduced to minimal.
>
> The second assumption is to make the project  stateless wherever possible. 
> The goal is to put on HA (Kubernetes) and store state in Redis (not yet 
> implemented). But generally OpenLogReplicator should not handle the 
> information (if possible) about the position of data confirmed by the 
> receiver. This would allow the receiver to choose way of handling failures 
> (data to be duplicated on restart, idempotent message).
>
> The third topic is initial data load. There is plenty of available software 
> for that. There is absolutely no need to duplicate it in this project. No 
> ETL, selects, etc. My goal is just to track changes.
>
> The fourth assumption is to write code in C++ so that the code is fast, and I 
> have full control over memory. The code can fully reuse memory and work also 
> with machines with little memory. This allows easy compilation on Linux, but 
> maybe in the future also on Solaris, AIX, HP-UX, or even Windows (if there is 
> demand for that). I think Java is good for some solutions but not for a 
> binary parser which heavily works with memory and in most cases uses zero 
> copy approach.
>
> Amount of data in the output is actually defined by source database (how much 
> data is logged - full schema or just changed columns). I don’t care. The user 
> defines that what is logged by db. If just primary key and changed columns - 
> I can send just changed data. If someone wants full schema in every payload - 
> this is fine too. If schema changes - no problem, I can provide just DDL 
> commands and process further payloads with new schema.
>
> Format of data - this is actually defined by the receiver. My first choice 
> was JSON. Next the Debezium guys asked me to support Protobuf. Ok, I have 
> spend a lot of time and extended the architecture to actually make the code 
> modular and allow to choose the format of the payload. The writer module can 
> directly produce json or protobuf payload. Actually that can be extended to 
> any other format if there is demand for that. Also the json format allows 
> many options regarding format. I generally don’t test protobuf format code - 
> I would treat that as a prototype because I know nobody who would like to use 
> it. This code was planned for Debezium request but so far nobody cares.
>
> For integration with other systems, languages - this is an open case. 
> Actually I am here agnostic. The data that is produced for output is stored 
> in a buffer and can be sent to any target. This is done by the Writer module 
> (you can look at the code) and there is a writer for Kafka, ZeroMQ and even 
> plain network tcp/ip connection. I don’t understand the question regarding to 
> adapt that better. If I have a specification I can extend. Say what you need.
>
> In such case when we have bidirectional connection not like with Kafka - the 
> receiver can define starting position of data (scn) of the stream he/she 
> wants to receive.
>
> You can look at the prototype code how this communication would look like: 
> StreamClient.cpp - but please rather treat that as a working prototype. This 
> is a client which just connects to OpenLogReplicator using network, and 
> defines the starting scn and then just receives payload.
>
> In case when:
> - The connection is broken: the client would reconnect and tell the last 
> confirmed data and just ask for the following transactions
> - If OpenLogReplicator crashes - after restart the client would tell the last 
> confirmed data and ask for the following transactions
> - If the client crashes - the client would need to recover itself and ask for 
> the transactions that are after the data that is confirmed by the client
>
> I assume that if the client confirms about some scn that is processed, 
> OpenLogReplicator can remove that from cache and it is not possible that 
> after reconnect the client would demand some data that it previously declared 
> as confirmed.
>
> Well,
> This is what is currently done, some code was driven by request from the 
> Debezium team towards future integration, like Support for Protobuf or put 
> some data to the payload. But never used.
> We have opened a ticket in their Jira for integration: 
> https://issues.redhat.com/projects/DBZ/issues/DBZ-2543?filter=allopenissues 
> But there is no progress and no feedback if they want to make integration or 
> not. I have made some effort to allow easier integration but I’m not going to 
> write a Kafka connect code for OpenLogReplicator. I just don’t have resources 
> for that. I think they focused on their own approach with LogMiner, waiting 
> for OpenLogReplicator to become more mature before any integration would be 
> done. If you want to depend Flink integration on the integration with 
> Debezium. This may never happen.
>
> I was focused recently mostly on making the code stable and releasing version 
> 1.0 and achieved that point. I am not aware of any problems with the code 
> that is currently working. The code is aimed to be modular and allow easy 
> integration, but as you mentioned there is no SDK. Actually this is the topic 
> that I would like to talk about. Is there reason for some SDK? Would someone 
> find it useful? Maybe just plain Kafka is enough. Maybe it would be best if 
> someone took the code and rewrote to Java? But definitely not me - I would 
> find that nonsense. Java code would suffer.
>
> What kind of interface would be best for Flink?
> OpenLogReplicator produces payload in protobuf or json. If you want to use 
> for example xml it would be  waste to write a converter, I would definitely 
> prefer to add another writer module that would just produce xml instead. If 
> you need certain format - this is no problem.
>
> But if you want to have full initial data load (snapshot) - this can’t be 
> done because this project is not for that. You have your own good code.
>
> In practice I think there would be just a few projects which could be the 
> receiver of data from OpenLogReplicator and there is no reason for writing a 
> generic SDK for everybody.
>
> My goal was just to start a conversation - discuss if such integration really 
> makes sense, or not. I really prefer simple architecture, as little 
> conversions of data as necessary. Not that I would give some format but you 
> would convert that anyway. This way replication from Oracle can be really 
> fast.
>
> I’m just about beginning to write tutorials for OpenLogReplicator and the 
> documentation is out of data. I have a regular daily job which I need to pay 
> the rent, and a family, and work on this projects just in free time so the 
> progress is slow. I don’t expect that to change in the future. But in spite 
> of that, I know companies who already use the code In production and it works 
> fast and stable. Client’s perspective is that it works 10 times faster than 
> LogMiner - but this would be dependent on the actual case. You would need to 
> make a benchmark and test yourself.
>
>
> Regards,
> Adam
>
>
>
>
> > On 5 Jan 2023, at 09:41, Leonard Xu <xbjt...@gmail.com> wrote:
> >
> > Hi, Adam & Márton
> >
> > Thanks for bringing the discussion here.
> >
> > The Flink CDC project provides the Oracle CDC Connector, which can be used 
> > to capture historical and transaction log data from the Oracle database and 
> > ingest it into Flink. In the latest version 2.3, the Oracle CDC Connector 
> > already supports the parallel-incremental snapshot algorithm is supported, 
> > which supports parallel reading for historical data and lock-free switching 
> > from historical reading to transaction log reading. In the phase of 
> > capturing transaction log data, the connector uses Debezium as the library, 
> > which supports  LogMiner and XStream API to capture change data. IIUC that 
> > OpenLogReplicator can be used as a third way.
> >
> > For integrating OpenLogReplicator, there are several interesting points 
> > that we can discuss further:
> > (1) All Flink CDC connectors do not rely on Kafka or other message queue 
> > storage, and are directly calculated after data capture. I think the 
> > network stream way of OpenLogReplicator needs to be adapted better.
> > (2) The Flink CDC project is mainly developed in Java as well as Flink. 
> > Does OpenLogReplicator provide Java SDK for easy integration?
> > (3) If OpenLogReplicator have a plan to be integrated into the Debezium 
> > project firstly, the Flink CDC project can easily integrate 
> > OpenLogReplicator by bumping Debezium version.
> >
> > Best,
> > Leonard
>
> > On 5 Jan 2023, at 04:15, Jark Wu <imj...@gmail.com> wrote:
> >
> > Hi Adam,
> >
> > Thanks for sharing this interesting project. I think it definitely is 
> > valuable for users for better speed.
> >
> > I am one of the maintainers of flink-cdc-connector project. The project 
> > offers a “oracle-cdc” connector which uses Debezium (depends on LogMiner) 
> > as the CDC library. From the perspective of “oracle-cdc” connector, I have 
> > some questions about OpenLogRelicator:
> >
> > 1) Can OpenLogReplicator provide a Java SDK to allow Flink to communicate 
> > with Oracle server directly without deploying any other service?
> > 2) How much overhead on Oracle compared to the LogMiner approach?
> > 3) Did you discuss this with the Debezium community? I think Debezium might 
> > be interested in this project as well.
> >
> >
> > Best,
> > Jark
> >
> >> 2023年1月5日 07:32，Adam Leszczyński <aleszczyn...@bersler.com> 写道：
> >>
> >> H Márton,
> >>
> >> Thank you very much for your answer.
> >>
> >> The point with Kafka makes sense. It offers huge bag of potential 
> >> connectors that could be used.
> >> But … not everybody wants or needs Kafka. This brings additional 
> >> architectural
> >> complication and delays, which might not be acceptable by everybody.
> >> That’s why you do have your own connectors anyway.
> >>
> >> The Flink connector which reads from Oracle utilizes the LogMiner 
> >> technology, which
> >> Is not acceptable for every user. It has big limitation regarding speed.
> >> You can overcome that only with a binary reader of the database redo log 
> >> (like 10 times
> >> faster and delay even up to 50-100ms).
> >>
> >> The reason I am asking is not just to create some additional connector 
> >> just for fun.
> >> My main concern is if there is actual demand from users for bigger speed of
> >> getting changes from the source database or having lower delay.
> >> You can find a lot of information in the net about differences between a 
> >> log-based and
> >> one which is using logminer technology.
> >>
> >> I think, that would be enough for a start. Please tell me what you think 
> >> about it.
> >> Would anyone consider using such connector?
> >>
> >> Regards,
> >> Adam Leszczyński
> >>
> >>
> >>> On 4 Jan 2023, at 12:07, Márton Balassi <balassi.mar...@gmail.com> wrote:
> >>>
> >>> (cc Leonard)
> >>>
> >>> Hi Adam,
> >>>
> >>> From an architectural perspective if you land the records to Kafka or 
> >>> other
> >>> message broker Flink will be able to process them, at this point I do not
> >>> see much merit trying to circumvent this step.
> >>> There is a related project in the Flink space called CDC connectors [1], I
> >>> highly encourage you to check that out for context and ccd Leonard one of
> >>> its primary maintainers.
> >>>
> >>> [1] https://github.com/ververica/flink-cdc-connectors/
> >>>
> >>> On Tue, Jan 3, 2023 at 8:40 PM Adam Leszczyński <aleszczyn...@bersler.com>
> >>> wrote:
> >>>
> >>>> Hi Flink Team,
> >>>>
> >>>> I’m the author of OpenLogReplictor - open source parser of Oracle redo
> >>>> logs which allows to send transactions
> >>>> to some message bus. Currently the sink that is implemented is just text
> >>>> file or Kafka topic.
> >>>> Also transactions can be sent using plain tcp connection or some message
> >>>> queue like ZeroMQ.
> >>>> Code is GPL and all versions from 11.2 are supported. No LogMiner needed.
> >>>>
> >>>> Transactions can be sent using json or protobuf format. Currently the 
> >>>> code
> >>>> has reached GA and is actually used in production.
> >>>>
> >>>> The architecture is modular and allows very easily to add other sinks 
> >>>> like
> >>>> for example Apache Flink.
> >>>> Actually I’m going towards approach that OpenLogReplicator could used
> >>>> Kubernetes and work in HA.
> >>>>
> >>>> Well… that is the general direction. Do you think there could some
> >>>> application of this soft with Apache Flink?
> >>>> For example very easily there could be some client which could connect to
> >>>> OpenLogReplicator using tcp connection
> >>>> and get transactions and just send them to Apache Flink. An example of
> >>>> such client is also present in GitHub repo.
> >>>> https://github.com/bersler/OpenLogReplicator
> >>>>
> >>>> Is there any rational for such integration? Or just a waste of time cause
> >>>> nobody would use it anyway?
> >>>>
> >>>> Kind regards,
> >>>> Adam Leszczyński
> >>>>
> >>>>
> >>
> >
>

Re: CDC from Oracle database reading directly logs - integration with OpenLogReplicator

Reply via email to