Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Nikolay Izhikov Tue, 26 Jan 2021 05:26:40 -0800

Hello, Alexey.

Sorry, for the long answer.


>   - The interface exposes WALRecord which is a private API

Not it's fixed.
CDC consumer should use public API to get notifications about a data change.
This API can be found in IEP [1] and PR [2]

```
@IgniteExperimental
public interface DataChangeListener<K, V> {
    String id();
    void start(IgniteConfiguration configuration, IgniteLogger log);
    boolean keepBinary();
    boolean onChange(Iterable<EntryEvent<K, V>> events);
    void stop();
}
 
@IgniteExperimental
public interface EntryEvent<K, V> {
    public K key();
    public V value();
    public boolean primary();
    EntryEventType operation();
    long cacheId();
    long expireTime();
}
```

> There is no way to start capturing changes from a certain point
> If a CDC agent is restarted, it will have to start from scratch.

There are a way :).

CDC store processed offset in a special file.
In the case of CDC restart changes will be captured from the last committed 
offset.

> Users can configure a large size for WAL archive to sustain long node 
> downtime for historical rebalance

To fix this issue I propose to introduce timeout for force WAL segment archive.
And yes, the event time gap and big WAL segments are tradeoffs for real-world 
deployment.

> - If a CDC reader does not keep up with the WAL write rate (e.g. there is a 
> short-term write burst and WAL archive is small), the Ignite node
will delete WAL segments while the consumer is still reading it.

This is fixed now.
I implemented Pavel proposal - On WAL rollover if CDC enabled hard link to 
archive segment is created in a special folder.
So CDC can process segment independently from main Ignite process and delete 
when finished.
Note, that segment data will be removed from the disk only after both CDC and 
Ignite will remove the hard link to it.

> If Ignite node crashes, gets restarted and initiates full rebalance, the 
> consumer will lose some updates

I expect that consumers will be started on each cluster node.
So, no event loss here.

>  Usually, it makes sense for the CDC consumer to read updates only on  
> primary nodes

Makes sense.
Thanks.
I've added `primary` flag to DataEntry WAL record it.
Take a look at PR [3]

>  the consumer would need to process backup records anyway because it is 
> unknown whether the primary consumer is alive. 

If CDC on some node is down it will deliver updates on restart.

I want to restrict CDC scope only for "Deliver local WAL event to the consumer".
CDC itself not responsible for distributed consumer state.
It's up to the consumer to implement some kind of failover scenario to keep all 
CDC up and running.

And yes, it's expected - if CDC is down then event lag is grown.
To prevent it the user can process all events, not only primary.

> In other words, how would an end-user organize the CDC failover minimizing 
> the duplicate work?

1. To recover from the CDC app failure simple restart will work.
2. For now, the user can distinguish between primary and backup DataEntry. 
This approach allows to the user prevent duplicate work and recover from Ignite 
node fail when the OS and CDC app still up.
3. If it's required to keep a small event gap in case of server failure(OS, 
Ignite node, and CDC app is down).
It's required to process changes for backup nodes and do some duplicate 
processing.

[1] 
https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
[2] https://github.com/apache/ignite/pull/8360
[3] https://github.com/apache/ignite/pull/8377


> 16 окт. 2020 г., в 14:19, Pavel Kovalenko <jokse...@gmail.com> написал(а):
> 
> Alexey,
> 
>>> If a CDC agent is restarted, it will have to start from scratch
>>> If a CDC reader does not keep up with the WAL write rate (e.g. there
>   is a short-term write burst and WAL archive is small), the Ignite node
> will
>   delete WAL segments while the consumer is still reading it.
> 
> I think these cases can be resolved with the following approach:
> PostgreSQL can be configured to execute a shell command after WAL segment
> is archived. The same thing we can do for Ignite as well.
> A command can create a hardlink for such WAL segment to a specified
> directory to not loose it after deletion by Ignite and notify a CDC (or
> another kind of process) about this segment.
> That will be a filesystem queue and CDC after restart may proceed only
> segments located at this directory, so it's no need to start from scratch.
> When WAL segment is processed by CDC a hardlink from queue directory is
> deleted.
> 
> 
> 
> пт, 16 окт. 2020 г. в 13:42, Alexey Goncharuk <alexey.goncha...@gmail.com>:
> 
>> Hello Nikolay,
>> 
>> Thanks for the suggestion, it definitely may be a good feature, however, I
>> do not see any significant value that it currently adds to the already
>> existing WAL Iterator. I think the following issues should be addressed,
>> otherwise, no regular user will be able to use the CDC reliably:
>> 
>>   - The interface exposes WALRecord which is a private API
>>   - There is no way to start capturing changes from a certain point (a
>>   watermark for already processed data). Users can configure a large size
>> for
>>   WAL archive to sustain long node downtime for historical rebalance. If a
>>   CDC agent is restarted, it will have to start from scratch. I see that
>> it
>>   is present in the IEP as a design choice, but I think this is a major
>>   usability issue
>>   - If a CDC reader does not keep up with the WAL write rate (e.g. there
>>   is a short-term write burst and WAL archive is small), the Ignite node
>> will
>>   delete WAL segments while the consumer is still reading it. Since the
>>   consumer is running out-of-process, we need to specify some sort of
>>   synchronization protocol between the node and the consumer
>>   - If Ignite node crashes, gets restarted and initiates full rebalance,
>>   the consumer will lose some updates
>>   - Usually, it makes sense for the CDC consumer to read updates only on
>>   primary nodes (otherwise, multiple agents will be doing duplicate
>> work). In
>>   the current design, the consumer will not be able to differentiate
>>   primary/backup updates. Moreover, even if we wrote such flags to WAL,
>> the
>>   consumer would need to process backup records anyway because it is
>> unknown
>>   whether the primary consumer is alive. In other words, how would an end
>>   user organize the CDC failover minimizing the duplicate work?
>> 
>> 
>> ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <nizhi...@apache.org>:
>> 
>>> Hello, Igniters.
>>> 
>>> I want to start a discussion of the new feature [1]
>>> 
>>> CDC - capture data change. The feature allows the consumer to receive
>>> online notifications about data record changes.
>>> 
>>> It can be used in the following scenarios:
>>>        * Export data into some warehouse, full-text search, or
>>> distributed log system.
>>>        * Online statistics and analytics.
>>>        * Wait and respond to some specific events or data changes.
>>> 
>>> Propose to implement new IgniteCDC application as follows:
>>>        * Run on the server node host.
>>>        * Watches for the appearance of the WAL archive segments.
>>>        * Iterates it using existing WALIterator and notifies consumer of
>>> each record from the segment.
>>> 
>>> IgniteCDC features:
>>>        * Independence from the server node process (JVM) - issues and
>>> failures of the consumer will not lead to server node instability.
>>>        * Notification guarantees and failover - i.e. CDC track and save
>>> the pointer to the last consumed record. Continue notification from this
>>> pointer in case of restart.
>>>        * Resilience for the consumer - it's not an issue when a consumer
>>> temporarily consumes slower than data appear.
>>> 
>>> WDYT?
>>> 
>>> [1]
>>> 
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
>>

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Reply via email to