Re: CDC from Oracle database reading directly logs - integration with OpenLogReplicator

Adam Leszczyński Mon, 09 Jan 2023 10:41:38 -0800

Hi Martijn,

This is not reasonable to merge Flink and OpenLogReplicator code. A simple 
network connection is enough and leave the projects separate. They can remain 
distinct.
You can have client/SDK code for OpenLogReplicator (like StreamClient.cpp in 
the code) which would be licensed with Apache 2.0. That can be a separate 
project.


Regards,
Adam


> On 9 Jan 2023, at 09:31, Martijn Visser <martijnvis...@apache.org> wrote:
> 
> Hi Adam,
> 
> Just a side note, since your code is licensed under GPL, I believe it can't
> be included with Flink since GPL licenses are considered category X and
> can't be included with Apache-licensed projects [1].
> 
> Best regards,
> 
> Martijn
> 
> [1] https://www.apache.org/legal/resolved.html#category-x
> 
> On Mon, Jan 9, 2023 at 12:47 AM Adam Leszczyński <aleszczyn...@bersler.com>
> wrote:
> 
>> Hi Gunnar,
>> 
>> Thank you very much for help. I really appreciate it.
>> 
>> I believe you might be right.
>> But still Flink has it’s own connector to Oracle which does not require
>> Debezium.
>> I’m not an expert and don’t have a wider view how most customer sites work.
>> 
>> My intention was just to clearly shot what the current state of
>> development is.
>> 
>> I’m getting close to the point to make a decision if OpenLogReplicator
>> should also have
>> a commercial version with some enterprise features or not, or maybe reduce
>> my time
>> spent on the project and just see what happens.
>> 
>> Regards,
>> Adam
>> 
>> 
>> 
>>> On 6 Jan 2023, at 13:18, Gunnar Morling
>> <gunnar.morl...@googlemail.com.INVALID> wrote:
>>> 
>>> Hey Adam, all,
>>> 
>>> Just came across this thread, still remembering the good conversations
>>> we had around this while I was working on Debezium full-time :)
>>> 
>>> Personally, I still believe the best way forward with this would be to
>>> add support to the Debezium connector for Oracle so it can ingest
>>> changes from a remote OpenLogReplicator instance via that server
>>> you've built. That way, you don't need to deal with any Kafka
>>> specifics, users would inherit the existing functionality for
>>> backfilling, integration with Debezium Server (i.e. for non-Kafka
>>> scenarios like Apache Pulsar, Kinesis, etc.), and Debezium engine
>>> (which is what Flink CDC is based on). The Debezium connector for
>>> Oracle is already built in a way that it supports multiple stream
>>> ingestion adapters (currently LogMiner and XStream), so adding another
>>> one for OLR would be rather simple. This approach would simplify
>>> things from a Flink (CDC) perspective a lot.
>>> 
>>> I've just pinged the folks over in the Debezium community on this, it
>>> would be great to see progress in this matter.
>>> 
>>> Best,
>>> 
>>> --Gunnar
>>> 
>>> 
>>> Am Do., 5. Jan. 2023 um 20:55 Uhr schrieb Adam Leszczyński
>>> <aleszczyn...@bersler.com>:
>>>> 
>>>> Thanks Leonard, Jark,
>>>> 
>>>> I will just reply on the dev list for this topic as this is more
>> related with development. Sorry, I have sent on 2 lists - I don’t want to
>> add more chaos here.
>>>> 
>>>> The answer to your question is not straight, so I will start from a
>> broader picture.
>>>> 
>>>> Maybe first I will describe some assumptions that I have chosen while
>> designing OpenLogReplicator. The project is aimed to be minimalistic. It
>> should only contain the code that is necessary to do parsing of Oracle redo
>> logs. Nothing more, it should not be a fully functional replicator. So, the
>> targets are limited to middleware (like Kafka, Flink, some MQ). The amount
>> of dependencies is reduced to minimal.
>>>> 
>>>> The second assumption is to make the project  stateless wherever
>> possible. The goal is to put on HA (Kubernetes) and store state in Redis
>> (not yet implemented). But generally OpenLogReplicator should not handle
>> the information (if possible) about the position of data confirmed by the
>> receiver. This would allow the receiver to choose way of handling failures
>> (data to be duplicated on restart, idempotent message).
>>>> 
>>>> The third topic is initial data load. There is plenty of available
>> software for that. There is absolutely no need to duplicate it in this
>> project. No ETL, selects, etc. My goal is just to track changes.
>>>> 
>>>> The fourth assumption is to write code in C++ so that the code is fast,
>> and I have full control over memory. The code can fully reuse memory and
>> work also with machines with little memory. This allows easy compilation on
>> Linux, but maybe in the future also on Solaris, AIX, HP-UX, or even Windows
>> (if there is demand for that). I think Java is good for some solutions but
>> not for a binary parser which heavily works with memory and in most cases
>> uses zero copy approach.
>>>> 
>>>> Amount of data in the output is actually defined by source database
>> (how much data is logged - full schema or just changed columns). I don’t
>> care. The user defines that what is logged by db. If just primary key and
>> changed columns - I can send just changed data. If someone wants full
>> schema in every payload - this is fine too. If schema changes - no problem,
>> I can provide just DDL commands and process further payloads with new
>> schema.
>>>> 
>>>> Format of data - this is actually defined by the receiver. My first
>> choice was JSON. Next the Debezium guys asked me to support Protobuf. Ok, I
>> have spend a lot of time and extended the architecture to actually make the
>> code modular and allow to choose the format of the payload. The writer
>> module can directly produce json or protobuf payload. Actually that can be
>> extended to any other format if there is demand for that. Also the json
>> format allows many options regarding format. I generally don’t test
>> protobuf format code - I would treat that as a prototype because I know
>> nobody who would like to use it. This code was planned for Debezium request
>> but so far nobody cares.
>>>> 
>>>> For integration with other systems, languages - this is an open case.
>> Actually I am here agnostic. The data that is produced for output is stored
>> in a buffer and can be sent to any target. This is done by the Writer
>> module (you can look at the code) and there is a writer for Kafka, ZeroMQ
>> and even plain network tcp/ip connection. I don’t understand the question
>> regarding to adapt that better. If I have a specification I can extend. Say
>> what you need.
>>>> 
>>>> In such case when we have bidirectional connection not like with Kafka
>> - the receiver can define starting position of data (scn) of the stream
>> he/she wants to receive.
>>>> 
>>>> You can look at the prototype code how this communication would look
>> like: StreamClient.cpp - but please rather treat that as a working
>> prototype. This is a client which just connects to OpenLogReplicator using
>> network, and defines the starting scn and then just receives payload.
>>>> 
>>>> In case when:
>>>> - The connection is broken: the client would reconnect and tell the
>> last confirmed data and just ask for the following transactions
>>>> - If OpenLogReplicator crashes - after restart the client would tell
>> the last confirmed data and ask for the following transactions
>>>> - If the client crashes - the client would need to recover itself and
>> ask for the transactions that are after the data that is confirmed by the
>> client
>>>> 
>>>> I assume that if the client confirms about some scn that is processed,
>> OpenLogReplicator can remove that from cache and it is not possible that
>> after reconnect the client would demand some data that it previously
>> declared as confirmed.
>>>> 
>>>> Well,
>>>> This is what is currently done, some code was driven by request from
>> the Debezium team towards future integration, like Support for Protobuf or
>> put some data to the payload. But never used.
>>>> We have opened a ticket in their Jira for integration:
>> https://issues.redhat.com/projects/DBZ/issues/DBZ-2543?filter=allopenissues
>> But there is no progress and no feedback if they want to make integration
>> or not. I have made some effort to allow easier integration but I’m not
>> going to write a Kafka connect code for OpenLogReplicator. I just don’t
>> have resources for that. I think they focused on their own approach with
>> LogMiner, waiting for OpenLogReplicator to become more mature before any
>> integration would be done. If you want to depend Flink integration on the
>> integration with Debezium. This may never happen.
>>>> 
>>>> I was focused recently mostly on making the code stable and releasing
>> version 1.0 and achieved that point. I am not aware of any problems with
>> the code that is currently working. The code is aimed to be modular and
>> allow easy integration, but as you mentioned there is no SDK. Actually this
>> is the topic that I would like to talk about. Is there reason for some SDK?
>> Would someone find it useful? Maybe just plain Kafka is enough. Maybe it
>> would be best if someone took the code and rewrote to Java? But definitely
>> not me - I would find that nonsense. Java code would suffer.
>>>> 
>>>> What kind of interface would be best for Flink?
>>>> OpenLogReplicator produces payload in protobuf or json. If you want to
>> use for example xml it would be  waste to write a converter, I would
>> definitely prefer to add another writer module that would just produce xml
>> instead. If you need certain format - this is no problem.
>>>> 
>>>> But if you want to have full initial data load (snapshot) - this can’t
>> be done because this project is not for that. You have your own good code.
>>>> 
>>>> In practice I think there would be just a few projects which could be
>> the receiver of data from OpenLogReplicator and there is no reason for
>> writing a generic SDK for everybody.
>>>> 
>>>> My goal was just to start a conversation - discuss if such integration
>> really makes sense, or not. I really prefer simple architecture, as little
>> conversions of data as necessary. Not that I would give some format but you
>> would convert that anyway. This way replication from Oracle can be really
>> fast.
>>>> 
>>>> I’m just about beginning to write tutorials for OpenLogReplicator and
>> the documentation is out of data. I have a regular daily job which I need
>> to pay the rent, and a family, and work on this projects just in free time
>> so the progress is slow. I don’t expect that to change in the future. But
>> in spite of that, I know companies who already use the code In production
>> and it works fast and stable. Client’s perspective is that it works 10
>> times faster than LogMiner - but this would be dependent on the actual
>> case. You would need to make a benchmark and test yourself.
>>>> 
>>>> 
>>>> Regards,
>>>> Adam
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 5 Jan 2023, at 09:41, Leonard Xu <xbjt...@gmail.com> wrote:
>>>>> 
>>>>> Hi, Adam & Márton
>>>>> 
>>>>> Thanks for bringing the discussion here.
>>>>> 
>>>>> The Flink CDC project provides the Oracle CDC Connector, which can be
>> used to capture historical and transaction log data from the Oracle
>> database and ingest it into Flink. In the latest version 2.3, the Oracle
>> CDC Connector already supports the parallel-incremental snapshot algorithm
>> is supported, which supports parallel reading for historical data and
>> lock-free switching from historical reading to transaction log reading. In
>> the phase of capturing transaction log data, the connector uses Debezium as
>> the library, which supports  LogMiner and XStream API to capture change
>> data. IIUC that OpenLogReplicator can be used as a third way.
>>>>> 
>>>>> For integrating OpenLogReplicator, there are several interesting
>> points that we can discuss further:
>>>>> (1) All Flink CDC connectors do not rely on Kafka or other message
>> queue storage, and are directly calculated after data capture. I think the
>> network stream way of OpenLogReplicator needs to be adapted better.
>>>>> (2) The Flink CDC project is mainly developed in Java as well as
>> Flink. Does OpenLogReplicator provide Java SDK for easy integration?
>>>>> (3) If OpenLogReplicator have a plan to be integrated into the
>> Debezium project firstly, the Flink CDC project can easily integrate
>> OpenLogReplicator by bumping Debezium version.
>>>>> 
>>>>> Best,
>>>>> Leonard
>>>> 
>>>>> On 5 Jan 2023, at 04:15, Jark Wu <imj...@gmail.com> wrote:
>>>>> 
>>>>> Hi Adam,
>>>>> 
>>>>> Thanks for sharing this interesting project. I think it definitely is
>> valuable for users for better speed.
>>>>> 
>>>>> I am one of the maintainers of flink-cdc-connector project. The
>> project offers a “oracle-cdc” connector which uses Debezium (depends on
>> LogMiner) as the CDC library. From the perspective of “oracle-cdc”
>> connector, I have some questions about OpenLogRelicator:
>>>>> 
>>>>> 1) Can OpenLogReplicator provide a Java SDK to allow Flink to
>> communicate with Oracle server directly without deploying any other service?
>>>>> 2) How much overhead on Oracle compared to the LogMiner approach?
>>>>> 3) Did you discuss this with the Debezium community? I think Debezium
>> might be interested in this project as well.
>>>>> 
>>>>> 
>>>>> Best,
>>>>> Jark
>>>>> 
>>>>>> 2023年1月5日 07:32，Adam Leszczyński <aleszczyn...@bersler.com> 写道：
>>>>>> 
>>>>>> H Márton,
>>>>>> 
>>>>>> Thank you very much for your answer.
>>>>>> 
>>>>>> The point with Kafka makes sense. It offers huge bag of potential
>> connectors that could be used.
>>>>>> But … not everybody wants or needs Kafka. This brings additional
>> architectural
>>>>>> complication and delays, which might not be acceptable by everybody.
>>>>>> That’s why you do have your own connectors anyway.
>>>>>> 
>>>>>> The Flink connector which reads from Oracle utilizes the LogMiner
>> technology, which
>>>>>> Is not acceptable for every user. It has big limitation regarding
>> speed.
>>>>>> You can overcome that only with a binary reader of the database redo
>> log (like 10 times
>>>>>> faster and delay even up to 50-100ms).
>>>>>> 
>>>>>> The reason I am asking is not just to create some additional
>> connector just for fun.
>>>>>> My main concern is if there is actual demand from users for bigger
>> speed of
>>>>>> getting changes from the source database or having lower delay.
>>>>>> You can find a lot of information in the net about differences
>> between a log-based and
>>>>>> one which is using logminer technology.
>>>>>> 
>>>>>> I think, that would be enough for a start. Please tell me what you
>> think about it.
>>>>>> Would anyone consider using such connector?
>>>>>> 
>>>>>> Regards,
>>>>>> Adam Leszczyński
>>>>>> 
>>>>>> 
>>>>>>> On 4 Jan 2023, at 12:07, Márton Balassi <balassi.mar...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> (cc Leonard)
>>>>>>> 
>>>>>>> Hi Adam,
>>>>>>> 
>>>>>>> From an architectural perspective if you land the records to Kafka
>> or other
>>>>>>> message broker Flink will be able to process them, at this point I
>> do not
>>>>>>> see much merit trying to circumvent this step.
>>>>>>> There is a related project in the Flink space called CDC connectors
>> [1], I
>>>>>>> highly encourage you to check that out for context and ccd Leonard
>> one of
>>>>>>> its primary maintainers.
>>>>>>> 
>>>>>>> [1] https://github.com/ververica/flink-cdc-connectors/
>>>>>>> 
>>>>>>> On Tue, Jan 3, 2023 at 8:40 PM Adam Leszczyński <
>> aleszczyn...@bersler.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Flink Team,
>>>>>>>> 
>>>>>>>> I’m the author of OpenLogReplictor - open source parser of Oracle
>> redo
>>>>>>>> logs which allows to send transactions
>>>>>>>> to some message bus. Currently the sink that is implemented is just
>> text
>>>>>>>> file or Kafka topic.
>>>>>>>> Also transactions can be sent using plain tcp connection or some
>> message
>>>>>>>> queue like ZeroMQ.
>>>>>>>> Code is GPL and all versions from 11.2 are supported. No LogMiner
>> needed.
>>>>>>>> 
>>>>>>>> Transactions can be sent using json or protobuf format. Currently
>> the code
>>>>>>>> has reached GA and is actually used in production.
>>>>>>>> 
>>>>>>>> The architecture is modular and allows very easily to add other
>> sinks like
>>>>>>>> for example Apache Flink.
>>>>>>>> Actually I’m going towards approach that OpenLogReplicator could
>> used
>>>>>>>> Kubernetes and work in HA.
>>>>>>>> 
>>>>>>>> Well… that is the general direction. Do you think there could some
>>>>>>>> application of this soft with Apache Flink?
>>>>>>>> For example very easily there could be some client which could
>> connect to
>>>>>>>> OpenLogReplicator using tcp connection
>>>>>>>> and get transactions and just send them to Apache Flink. An example
>> of
>>>>>>>> such client is also present in GitHub repo.
>>>>>>>> https://github.com/bersler/OpenLogReplicator
>>>>>>>> 
>>>>>>>> Is there any rational for such integration? Or just a waste of time
>> cause
>>>>>>>> nobody would use it anyway?
>>>>>>>> 
>>>>>>>> Kind regards,
>>>>>>>> Adam Leszczyński
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: CDC from Oracle database reading directly logs - integration with OpenLogReplicator

Reply via email to