Re: CDC from Oracle database reading directly logs - integration with OpenLogReplicator

Martijn Visser Mon, 09 Jan 2023 00:31:55 -0800

Hi Adam,

Just a side note, since your code is licensed under GPL, I believe it can't
be included with Flink since GPL licenses are considered category X and
can't be included with Apache-licensed projects [1].


Best regards,

Martijn

[1] https://www.apache.org/legal/resolved.html#category-x

On Mon, Jan 9, 2023 at 12:47 AM Adam Leszczyński <aleszczyn...@bersler.com>
wrote:

> Hi Gunnar,
>
> Thank you very much for help. I really appreciate it.
>
> I believe you might be right.
> But still Flink has it’s own connector to Oracle which does not require
> Debezium.
> I’m not an expert and don’t have a wider view how most customer sites work.
>
> My intention was just to clearly shot what the current state of
> development is.
>
> I’m getting close to the point to make a decision if OpenLogReplicator
> should also have
> a commercial version with some enterprise features or not, or maybe reduce
> my time
> spent on the project and just see what happens.
>
> Regards,
> Adam
>
>
>
> > On 6 Jan 2023, at 13:18, Gunnar Morling
> <gunnar.morl...@googlemail.com.INVALID> wrote:
> >
> > Hey Adam, all,
> >
> > Just came across this thread, still remembering the good conversations
> > we had around this while I was working on Debezium full-time :)
> >
> > Personally, I still believe the best way forward with this would be to
> > add support to the Debezium connector for Oracle so it can ingest
> > changes from a remote OpenLogReplicator instance via that server
> > you've built. That way, you don't need to deal with any Kafka
> > specifics, users would inherit the existing functionality for
> > backfilling, integration with Debezium Server (i.e. for non-Kafka
> > scenarios like Apache Pulsar, Kinesis, etc.), and Debezium engine
> > (which is what Flink CDC is based on). The Debezium connector for
> > Oracle is already built in a way that it supports multiple stream
> > ingestion adapters (currently LogMiner and XStream), so adding another
> > one for OLR would be rather simple. This approach would simplify
> > things from a Flink (CDC) perspective a lot.
> >
> > I've just pinged the folks over in the Debezium community on this, it
> > would be great to see progress in this matter.
> >
> > Best,
> >
> > --Gunnar
> >
> >
> > Am Do., 5. Jan. 2023 um 20:55 Uhr schrieb Adam Leszczyński
> > <aleszczyn...@bersler.com>:
> >>
> >> Thanks Leonard, Jark,
> >>
> >> I will just reply on the dev list for this topic as this is more
> related with development. Sorry, I have sent on 2 lists - I don’t want to
> add more chaos here.
> >>
> >> The answer to your question is not straight, so I will start from a
> broader picture.
> >>
> >> Maybe first I will describe some assumptions that I have chosen while
> designing OpenLogReplicator. The project is aimed to be minimalistic. It
> should only contain the code that is necessary to do parsing of Oracle redo
> logs. Nothing more, it should not be a fully functional replicator. So, the
> targets are limited to middleware (like Kafka, Flink, some MQ). The amount
> of dependencies is reduced to minimal.
> >>
> >> The second assumption is to make the project  stateless wherever
> possible. The goal is to put on HA (Kubernetes) and store state in Redis
> (not yet implemented). But generally OpenLogReplicator should not handle
> the information (if possible) about the position of data confirmed by the
> receiver. This would allow the receiver to choose way of handling failures
> (data to be duplicated on restart, idempotent message).
> >>
> >> The third topic is initial data load. There is plenty of available
> software for that. There is absolutely no need to duplicate it in this
> project. No ETL, selects, etc. My goal is just to track changes.
> >>
> >> The fourth assumption is to write code in C++ so that the code is fast,
> and I have full control over memory. The code can fully reuse memory and
> work also with machines with little memory. This allows easy compilation on
> Linux, but maybe in the future also on Solaris, AIX, HP-UX, or even Windows
> (if there is demand for that). I think Java is good for some solutions but
> not for a binary parser which heavily works with memory and in most cases
> uses zero copy approach.
> >>
> >> Amount of data in the output is actually defined by source database
> (how much data is logged - full schema or just changed columns). I don’t
> care. The user defines that what is logged by db. If just primary key and
> changed columns - I can send just changed data. If someone wants full
> schema in every payload - this is fine too. If schema changes - no problem,
> I can provide just DDL commands and process further payloads with new
> schema.
> >>
> >> Format of data - this is actually defined by the receiver. My first
> choice was JSON. Next the Debezium guys asked me to support Protobuf. Ok, I
> have spend a lot of time and extended the architecture to actually make the
> code modular and allow to choose the format of the payload. The writer
> module can directly produce json or protobuf payload. Actually that can be
> extended to any other format if there is demand for that. Also the json
> format allows many options regarding format. I generally don’t test
> protobuf format code - I would treat that as a prototype because I know
> nobody who would like to use it. This code was planned for Debezium request
> but so far nobody cares.
> >>
> >> For integration with other systems, languages - this is an open case.
> Actually I am here agnostic. The data that is produced for output is stored
> in a buffer and can be sent to any target. This is done by the Writer
> module (you can look at the code) and there is a writer for Kafka, ZeroMQ
> and even plain network tcp/ip connection. I don’t understand the question
> regarding to adapt that better. If I have a specification I can extend. Say
> what you need.
> >>
> >> In such case when we have bidirectional connection not like with Kafka
> - the receiver can define starting position of data (scn) of the stream
> he/she wants to receive.
> >>
> >> You can look at the prototype code how this communication would look
> like: StreamClient.cpp - but please rather treat that as a working
> prototype. This is a client which just connects to OpenLogReplicator using
> network, and defines the starting scn and then just receives payload.
> >>
> >> In case when:
> >> - The connection is broken: the client would reconnect and tell the
> last confirmed data and just ask for the following transactions
> >> - If OpenLogReplicator crashes - after restart the client would tell
> the last confirmed data and ask for the following transactions
> >> - If the client crashes - the client would need to recover itself and
> ask for the transactions that are after the data that is confirmed by the
> client
> >>
> >> I assume that if the client confirms about some scn that is processed,
> OpenLogReplicator can remove that from cache and it is not possible that
> after reconnect the client would demand some data that it previously
> declared as confirmed.
> >>
> >> Well,
> >> This is what is currently done, some code was driven by request from
> the Debezium team towards future integration, like Support for Protobuf or
> put some data to the payload. But never used.
> >> We have opened a ticket in their Jira for integration:
> https://issues.redhat.com/projects/DBZ/issues/DBZ-2543?filter=allopenissues
> But there is no progress and no feedback if they want to make integration
> or not. I have made some effort to allow easier integration but I’m not
> going to write a Kafka connect code for OpenLogReplicator. I just don’t
> have resources for that. I think they focused on their own approach with
> LogMiner, waiting for OpenLogReplicator to become more mature before any
> integration would be done. If you want to depend Flink integration on the
> integration with Debezium. This may never happen.
> >>
> >> I was focused recently mostly on making the code stable and releasing
> version 1.0 and achieved that point. I am not aware of any problems with
> the code that is currently working. The code is aimed to be modular and
> allow easy integration, but as you mentioned there is no SDK. Actually this
> is the topic that I would like to talk about. Is there reason for some SDK?
> Would someone find it useful? Maybe just plain Kafka is enough. Maybe it
> would be best if someone took the code and rewrote to Java? But definitely
> not me - I would find that nonsense. Java code would suffer.
> >>
> >> What kind of interface would be best for Flink?
> >> OpenLogReplicator produces payload in protobuf or json. If you want to
> use for example xml it would be  waste to write a converter, I would
> definitely prefer to add another writer module that would just produce xml
> instead. If you need certain format - this is no problem.
> >>
> >> But if you want to have full initial data load (snapshot) - this can’t
> be done because this project is not for that. You have your own good code.
> >>
> >> In practice I think there would be just a few projects which could be
> the receiver of data from OpenLogReplicator and there is no reason for
> writing a generic SDK for everybody.
> >>
> >> My goal was just to start a conversation - discuss if such integration
> really makes sense, or not. I really prefer simple architecture, as little
> conversions of data as necessary. Not that I would give some format but you
> would convert that anyway. This way replication from Oracle can be really
> fast.
> >>
> >> I’m just about beginning to write tutorials for OpenLogReplicator and
> the documentation is out of data. I have a regular daily job which I need
> to pay the rent, and a family, and work on this projects just in free time
> so the progress is slow. I don’t expect that to change in the future. But
> in spite of that, I know companies who already use the code In production
> and it works fast and stable. Client’s perspective is that it works 10
> times faster than LogMiner - but this would be dependent on the actual
> case. You would need to make a benchmark and test yourself.
> >>
> >>
> >> Regards,
> >> Adam
> >>
> >>
> >>
> >>
> >>> On 5 Jan 2023, at 09:41, Leonard Xu <xbjt...@gmail.com> wrote:
> >>>
> >>> Hi, Adam & Márton
> >>>
> >>> Thanks for bringing the discussion here.
> >>>
> >>> The Flink CDC project provides the Oracle CDC Connector, which can be
> used to capture historical and transaction log data from the Oracle
> database and ingest it into Flink. In the latest version 2.3, the Oracle
> CDC Connector already supports the parallel-incremental snapshot algorithm
> is supported, which supports parallel reading for historical data and
> lock-free switching from historical reading to transaction log reading. In
> the phase of capturing transaction log data, the connector uses Debezium as
> the library, which supports  LogMiner and XStream API to capture change
> data. IIUC that OpenLogReplicator can be used as a third way.
> >>>
> >>> For integrating OpenLogReplicator, there are several interesting
> points that we can discuss further:
> >>> (1) All Flink CDC connectors do not rely on Kafka or other message
> queue storage, and are directly calculated after data capture. I think the
> network stream way of OpenLogReplicator needs to be adapted better.
> >>> (2) The Flink CDC project is mainly developed in Java as well as
> Flink. Does OpenLogReplicator provide Java SDK for easy integration?
> >>> (3) If OpenLogReplicator have a plan to be integrated into the
> Debezium project firstly, the Flink CDC project can easily integrate
> OpenLogReplicator by bumping Debezium version.
> >>>
> >>> Best,
> >>> Leonard
> >>
> >>> On 5 Jan 2023, at 04:15, Jark Wu <imj...@gmail.com> wrote:
> >>>
> >>> Hi Adam,
> >>>
> >>> Thanks for sharing this interesting project. I think it definitely is
> valuable for users for better speed.
> >>>
> >>> I am one of the maintainers of flink-cdc-connector project. The
> project offers a “oracle-cdc” connector which uses Debezium (depends on
> LogMiner) as the CDC library. From the perspective of “oracle-cdc”
> connector, I have some questions about OpenLogRelicator:
> >>>
> >>> 1) Can OpenLogReplicator provide a Java SDK to allow Flink to
> communicate with Oracle server directly without deploying any other service?
> >>> 2) How much overhead on Oracle compared to the LogMiner approach?
> >>> 3) Did you discuss this with the Debezium community? I think Debezium
> might be interested in this project as well.
> >>>
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>>> 2023年1月5日 07:32，Adam Leszczyński <aleszczyn...@bersler.com> 写道：
> >>>>
> >>>> H Márton,
> >>>>
> >>>> Thank you very much for your answer.
> >>>>
> >>>> The point with Kafka makes sense. It offers huge bag of potential
> connectors that could be used.
> >>>> But … not everybody wants or needs Kafka. This brings additional
> architectural
> >>>> complication and delays, which might not be acceptable by everybody.
> >>>> That’s why you do have your own connectors anyway.
> >>>>
> >>>> The Flink connector which reads from Oracle utilizes the LogMiner
> technology, which
> >>>> Is not acceptable for every user. It has big limitation regarding
> speed.
> >>>> You can overcome that only with a binary reader of the database redo
> log (like 10 times
> >>>> faster and delay even up to 50-100ms).
> >>>>
> >>>> The reason I am asking is not just to create some additional
> connector just for fun.
> >>>> My main concern is if there is actual demand from users for bigger
> speed of
> >>>> getting changes from the source database or having lower delay.
> >>>> You can find a lot of information in the net about differences
> between a log-based and
> >>>> one which is using logminer technology.
> >>>>
> >>>> I think, that would be enough for a start. Please tell me what you
> think about it.
> >>>> Would anyone consider using such connector?
> >>>>
> >>>> Regards,
> >>>> Adam Leszczyński
> >>>>
> >>>>
> >>>>> On 4 Jan 2023, at 12:07, Márton Balassi <balassi.mar...@gmail.com>
> wrote:
> >>>>>
> >>>>> (cc Leonard)
> >>>>>
> >>>>> Hi Adam,
> >>>>>
> >>>>> From an architectural perspective if you land the records to Kafka
> or other
> >>>>> message broker Flink will be able to process them, at this point I
> do not
> >>>>> see much merit trying to circumvent this step.
> >>>>> There is a related project in the Flink space called CDC connectors
> [1], I
> >>>>> highly encourage you to check that out for context and ccd Leonard
> one of
> >>>>> its primary maintainers.
> >>>>>
> >>>>> [1] https://github.com/ververica/flink-cdc-connectors/
> >>>>>
> >>>>> On Tue, Jan 3, 2023 at 8:40 PM Adam Leszczyński <
> aleszczyn...@bersler.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Flink Team,
> >>>>>>
> >>>>>> I’m the author of OpenLogReplictor - open source parser of Oracle
> redo
> >>>>>> logs which allows to send transactions
> >>>>>> to some message bus. Currently the sink that is implemented is just
> text
> >>>>>> file or Kafka topic.
> >>>>>> Also transactions can be sent using plain tcp connection or some
> message
> >>>>>> queue like ZeroMQ.
> >>>>>> Code is GPL and all versions from 11.2 are supported. No LogMiner
> needed.
> >>>>>>
> >>>>>> Transactions can be sent using json or protobuf format. Currently
> the code
> >>>>>> has reached GA and is actually used in production.
> >>>>>>
> >>>>>> The architecture is modular and allows very easily to add other
> sinks like
> >>>>>> for example Apache Flink.
> >>>>>> Actually I’m going towards approach that OpenLogReplicator could
> used
> >>>>>> Kubernetes and work in HA.
> >>>>>>
> >>>>>> Well… that is the general direction. Do you think there could some
> >>>>>> application of this soft with Apache Flink?
> >>>>>> For example very easily there could be some client which could
> connect to
> >>>>>> OpenLogReplicator using tcp connection
> >>>>>> and get transactions and just send them to Apache Flink. An example
> of
> >>>>>> such client is also present in GitHub repo.
> >>>>>> https://github.com/bersler/OpenLogReplicator
> >>>>>>
> >>>>>> Is there any rational for such integration? Or just a waste of time
> cause
> >>>>>> nobody would use it anyway?
> >>>>>>
> >>>>>> Kind regards,
> >>>>>> Adam Leszczyński
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>
>
>

Re: CDC from Oracle database reading directly logs - integration with OpenLogReplicator

Reply via email to