Re: [DISCUSSION] Scan mode in Table Store for Flink Stream and Batch job

Jingsong Li Fri, 09 Dec 2022 01:33:17 -0800

Thanks Shammon!

Your summary was very good and very detailed.


I thought about it again.

## Solution 1
Actually, according to what you said, there should be so many modes in theory.
- Runtime-mode: streaming or batch.
- Range: full or incremental.
- Position: Latest, timestamp, snapshot-id, compacted.

Advantages: The disassembly is very detailed, and every action is very clear.
Disadvantages: There are many combinations from orthogonality. In
combination with runtime-mode stream or batch, we can say that there
are 16 modes from orthogonality, many of which are meaningless. As you
said, default behavior is also a problem.

## Solution 2
Currently [1]:
- The environment determines the runtime-mode whether it is streaming
or a batch.
- The `scan.mode` determines position.
- No specific option determines `range`, but it is determined by
runtime-mode. However, it is not completely determined by runtime
mode, such as `full` and `compacted`, which are also read in
full-range under the stream.

Advantages: Simple. The default values of options are what we want for
streaming and batch.
Disadvantages:
1. The semantics of from timestamp are different in the case of
streaming and batch.
2. `full` and `compacted` are special.

## Solution 3

I understand that the core problem of solution2 may be more problem 2:
`full` and `compacted` are special.
How about:
- the runtime mode determines whether to read incremental only or full data.
- `scan.mode` contains: Latest, timestamp, snapshot-id.
The default is full in batch mode and incremental in stream mode.

However, we have two other choices for `scan.mode`: `latest-full`,
`compacted-full`. Regardless of the runtime-mode, the two choices
force full range to read.

I think solution 3 is a compromise solution. It can also ensure the
availability of default values. Conceptually, it can at least explain
the current options.

What do you think?

[1] 
https://nightlies.apache.org/flink/flink-table-store-docs-master/docs/development/configuration/

Best,
Jingsong

On Thu, Dec 8, 2022 at 4:11 PM Shammon FY <zjur...@gmail.com> wrote:
>
> Hi devs:
>
> I'm an engineer from ByteDance, and here I'd link to discuss "Scan mode in 
> Table Store for Flink Stream and Batch job".
>
> Users can execute Flink Steam and Batch jobs on Table Store. In Table Store 
> 0.2 there're two items which determine how the Stream and Batch jobs' sources 
> read data: StartupMode and config in Options.
> 1. StartupMode
>   a) DEFAULT. Determines actual startup mode according to other table 
> properties.  If \"scan.timestamp-millis\" is set, the actual startup mode 
> will be \"from-timestamp\" mode. Otherwise, the actual startup mode will be 
> \"full\".
>   b) FULL. For streaming sources, read the latest snapshot on the table upon 
> first startup, and continue to read the latest changes. For batch sources, 
> just consume the latest snapshot but do not read new changes.
>   c) LATEST. For streaming sources, continuously reads the latest changes 
> without reading a snapshot at the beginning.  For batch sources, behaves the 
> same as the \"full\" startup mode.
>   d) FROM_TIMESTAMP. For streaming sources, continuously reads changes 
> starting from timestamp specified by \"scan.timestamp-millis\", without 
> reading a snapshot at the beginning. For batch sources, read a snapshot at 
> timestamp specified by \"scan.timestamp-millis\" but do not read new changes.
> 2. Config in Options
>   a) scan.timestamp-millis, log.scan.timestamp-millis. Optional timestamp 
> used in case of \"from-timestamp\" scan mode.
>   b) read.compacted. Read the latest compact snapshots only.
>
> After discussing with @Jingsong Li and @Caizhi wen, we found that the config 
> in Options and StartupMode are not orthogonal. For example, read.compacted 
> and FROM_TIMESTAMP mode and its behavior in Stream and Batch sources. We want 
> to improve StartupMode to unify the data reading mode of Stream and Batch 
> jobs, and add the following StartupMode item:
> COMPACTED: For streaming sources, read a snapshot after the latest compaction 
> on the table upon first startup, and continue to read the latest changes. For 
> batch sources, just read a snapshot after the latest compaction but do not 
> read new changes.
> The advantage is that for Stream and Batch jobs, we only need to determine 
> their behavior through StartupMode, but we also found two main problems:
> 1. The behaviors of some StartupModes in Stream and Batch jobs are 
> inconsistent, which may cause user misunderstanding, such as FROM_ TIMESTAMP: 
> streaming job reads incremental data, while batch job reads full data
> 2. StartupMode does not define all data reading modes. For example, streaming 
> jobs read snapshots according to timestamp, and then read incremental data.
>
> To support all data reading modes in Table Store such as time travel, we try 
> to divide data reading into two orthogonal dimensions: data reading range and 
> startup position.
> 1. Data reading range
>   a) Incremental, read delta data
>   b) Full, read a snapshot first, then read the incremental data according to 
> different job types (Streaming job).
> 2. Startup position
>   a) Latest, read the latest snapshot.
>   b) From-timestamp，read a snapshot according to the timestamp.
>   c) From-snapshot-id，read a snapshot according to the snapshot id.
>   d) Compacted，read the latest compacted snapshot.
>
> Then there're two questions:
>
> Q1: Does it need to divide into two configuration items, or combine and unify 
> them in StartupMode?
> In StartupMode, it can be unified into eight modes. The advantage is that 
> through StartupMode, we can clearly explain the behavior of stream and batch 
> jobs, but there are many defined items, some of which are meaningless
> There can be two configurations, users may need to understand the different 
> meanings of the two configurations. After combination, stream and batch jobs 
> will have different processing logic.
>
> Q2: In Table Store 0.2, there are some conflicting definitions in Stream and 
> Batch. Does Table Store 0.3 need to be fully compatible or directly 
> implemented according to the new definitions? E.g. behavior in FROM_TIMESTAMP 
> mode.
> Fully compatible with the definition of Table Store 0.2, users can directly 
> execute queries in 0.2 on 0.3, but semantic conflicts in the original 
> definition will always exist in the next versions.
> Only limited compatibility is available, and the default behavior is 
> implemented according to the new definition, which may result in the query 
> error of 0.3 when the user queries on 0.2 successfully.
>
> Look forward to your feedback, thanks
>
> Best,
> Shammon
>
>
>

Re: [DISCUSSION] Scan mode in Table Store for Flink Stream and Batch job

Reply via email to