This is an automated email from the ASF dual-hosted git repository.

wenweihuang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/inlong-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 03748f668c [INLONG-951][Doc] Update agent overview (#952)
03748f668c is described below

commit 03748f668c748c8508943c466ecf67bc0cd5c428
Author: justinwwhuang <hww_jus...@163.com>
AuthorDate: Sat May 11 18:29:47 2024 +0800

    [INLONG-951][Doc] Update agent overview (#952)
    
    * [INLONG-951][Doc] Update agent overview
    
    * [INLONG-951][Doc] Update agent overview
---
 docs/modules/agent/img/agent_overview_10.png       | Bin 0 -> 131156 bytes
 docs/modules/agent/img/agent_overview_11.png       | Bin 0 -> 143145 bytes
 docs/modules/agent/img/agent_overview_12.png       | Bin 0 -> 39589 bytes
 docs/modules/agent/img/agent_overview_2.png        | Bin 0 -> 97291 bytes
 docs/modules/agent/img/agent_overview_3.png        | Bin 0 -> 50232 bytes
 docs/modules/agent/img/agent_overview_4.png        | Bin 0 -> 109340 bytes
 docs/modules/agent/img/agent_overview_5.png        | Bin 0 -> 74638 bytes
 docs/modules/agent/img/agent_overview_6.png        | Bin 0 -> 120877 bytes
 docs/modules/agent/img/agent_overview_7.png        | Bin 0 -> 47835 bytes
 docs/modules/agent/img/agent_overview_8.png        | Bin 0 -> 84505 bytes
 docs/modules/agent/img/agent_overview_9.png        | Bin 0 -> 77947 bytes
 docs/modules/agent/img/architecture.png            | Bin 43613 -> 0 bytes
 docs/modules/agent/overview.md                     | 149 +++++++++++++-------
 .../modules/agent/img/agent_overview_10.png        | Bin 0 -> 131156 bytes
 .../modules/agent/img/agent_overview_11.png        | Bin 0 -> 143145 bytes
 .../modules/agent/img/agent_overview_12.png        | Bin 0 -> 39589 bytes
 .../current/modules/agent/img/agent_overview_2.png | Bin 0 -> 97291 bytes
 .../current/modules/agent/img/agent_overview_3.png | Bin 0 -> 50232 bytes
 .../current/modules/agent/img/agent_overview_4.png | Bin 0 -> 109340 bytes
 .../current/modules/agent/img/agent_overview_5.png | Bin 0 -> 74638 bytes
 .../current/modules/agent/img/agent_overview_6.png | Bin 0 -> 120877 bytes
 .../current/modules/agent/img/agent_overview_7.png | Bin 0 -> 47835 bytes
 .../current/modules/agent/img/agent_overview_8.png | Bin 0 -> 84505 bytes
 .../current/modules/agent/img/agent_overview_9.png | Bin 0 -> 77947 bytes
 .../current/modules/agent/img/architecture.png     | Bin 43613 -> 0 bytes
 .../current/modules/agent/overview.md              | 150 ++++++++++++++-------
 26 files changed, 204 insertions(+), 95 deletions(-)

diff --git a/docs/modules/agent/img/agent_overview_10.png 
b/docs/modules/agent/img/agent_overview_10.png
new file mode 100644
index 0000000000..20ea3f88f4
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_10.png differ
diff --git a/docs/modules/agent/img/agent_overview_11.png 
b/docs/modules/agent/img/agent_overview_11.png
new file mode 100644
index 0000000000..88bb51c27f
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_11.png differ
diff --git a/docs/modules/agent/img/agent_overview_12.png 
b/docs/modules/agent/img/agent_overview_12.png
new file mode 100644
index 0000000000..c7eeb55cbe
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_12.png differ
diff --git a/docs/modules/agent/img/agent_overview_2.png 
b/docs/modules/agent/img/agent_overview_2.png
new file mode 100644
index 0000000000..f31a92febd
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_2.png differ
diff --git a/docs/modules/agent/img/agent_overview_3.png 
b/docs/modules/agent/img/agent_overview_3.png
new file mode 100644
index 0000000000..a21793a190
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_3.png differ
diff --git a/docs/modules/agent/img/agent_overview_4.png 
b/docs/modules/agent/img/agent_overview_4.png
new file mode 100644
index 0000000000..1a40dd7e6c
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_4.png differ
diff --git a/docs/modules/agent/img/agent_overview_5.png 
b/docs/modules/agent/img/agent_overview_5.png
new file mode 100644
index 0000000000..cedd98c5b8
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_5.png differ
diff --git a/docs/modules/agent/img/agent_overview_6.png 
b/docs/modules/agent/img/agent_overview_6.png
new file mode 100644
index 0000000000..d02bf5fef4
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_6.png differ
diff --git a/docs/modules/agent/img/agent_overview_7.png 
b/docs/modules/agent/img/agent_overview_7.png
new file mode 100644
index 0000000000..642c59b1a2
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_7.png differ
diff --git a/docs/modules/agent/img/agent_overview_8.png 
b/docs/modules/agent/img/agent_overview_8.png
new file mode 100644
index 0000000000..c45187ea92
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_8.png differ
diff --git a/docs/modules/agent/img/agent_overview_9.png 
b/docs/modules/agent/img/agent_overview_9.png
new file mode 100644
index 0000000000..d3472f2b15
Binary files /dev/null and b/docs/modules/agent/img/agent_overview_9.png differ
diff --git a/docs/modules/agent/img/architecture.png 
b/docs/modules/agent/img/architecture.png
deleted file mode 100644
index 1138fe1b62..0000000000
Binary files a/docs/modules/agent/img/architecture.png and /dev/null differ
diff --git a/docs/modules/agent/overview.md b/docs/modules/agent/overview.md
index 7bc65e59e5..97f482ded0 100644
--- a/docs/modules/agent/overview.md
+++ b/docs/modules/agent/overview.md
@@ -3,50 +3,105 @@ title: Overview
 sidebar_position: 1
 ---
 
-InLong Agent is a collection tool that supports multiple types of data 
sources, and is committed to achieving stable and efficient data collection 
functions between multiple heterogeneous data sources including File, SQL, 
Binlog, Metrics, etc.
-
-## Design Concept
-In order to solve the problem of data source diversity, InLong-agent abstracts 
multiple data sources into a unified source concept, and abstracts sinks to 
write data. When you need to access a new data source, you only need to 
configure the format and reading parameters of the data source to achieve 
efficient reading.
-
-## InLong-Agent Architecture
-![](img/architecture.png)
-
-The InLong Agent task is used as a data acquisition framework, constructed 
with a channel + plug-in architecture. Read and write the data source into a 
reader/writer plug-in, and then into the entire framework.
-
-- Reader: Reader is the data collection module, responsible for collecting 
data from the data source and sending the data to the channel.
-- Writer: Writer is a data writing module, which reuses data continuously to 
the channel and writes the data to the destination.
-- Channel: The channel used to connect the reader and writer, and as the data 
transmission channel of the connection, which realizes the function of data 
reading and monitoring
-
-## Different kinds of agent
-### File
-File collection includes the following functions:
-
-User-configured path monitoring, able to monitor the created file information
-Directory regular filtering, support YYYYMMDD+regular expression path 
configuration
-Breakpoint retransmission, when InLong-Agent restarts, it can automatically 
re-read from the last read position to ensure no reread or missed reading.
-
-#### File options
-| Parameter                     | Required | Default value | Type   | 
Description                                                  |
-| ----------------------------- | -------- | ------------- | ------ | 
------------------------------------------------------------ |
-| pattern                       | required | (none)        | String | File 
pattern. For example: /root/[*].log      |
-| timeOffset                    | optional | (none)        | String | File 
name includes time, for example: *** YYYYMMDDHH *** YYYY represents the year, 
MM represents the month, DD represents the day, and HH represents the hour, *** 
is any character. '1m' means one minute after, '-1m' means one minute before. 
'1h' means one hour after, '-1h' means one hour before. '1d' means one day 
after, '-1d' means one day before.|
-| collectType                   | optional |  FULL         | String | FULL is 
all file. INCREMENT is the newly created file after start task.                 
     |
-| lineEndPattern                | optional | '\n'          | String | Line of 
file end pattern. |
-| contentCollectType            | optional |  FULL         | String | Collect 
data of file content. |
-| envList                       | optional | (none)        | String | File 
needs to collect environment information, for example: kubernetes.            |
-| dataContentStyle              | optional | (none)        | String | Type of 
data result for column separator. Json format, set this parameter to json. CSV 
format, set this parameter to a custom separator: `,` &#124; `:`           |
-| dataSeparator                 | optional | (none)        | String | Column 
separator of data source.            |
-| monitorStatus                 | optional | (none)        | Integer| Monitor 
switch, 1 true and 0 false. Batch data is 0,real time data is 1. |
-| monitorInterval               | optional | (none)        | Long   | Monitor 
interval for file. |
-| monitorExpire                 | optional | (none)        | Long   | Monitor 
expire time and the time in milliseconds. |
-
-### SQL
-This type of data refers to the way it is executed through SQL
-SQL regular decomposition, converted into multiple SQL statements
-Execute SQL separately, pull the data set, the pull process needs to pay 
attention to the impact on mysql itself
-The execution cycle, which is generally executed regularly
-
-### Binlog
-This type of collection reads binlog and restores data by configuring mysql 
slave
-Need to pay attention to multi-threaded parsing when binlog is read, and 
multi-threaded parsing data needs to be labeled in order
-The code is based on the old version of dbsync, the main modification is to 
change the sending of tdbus-sender to push to agent-channel for integration
\ No newline at end of file
+The InLong Agent belongs to the collection layer of the InLong data link and 
is a collection tool that supports multiple types of data sources. It is 
committed to achieving stable and efficient data collection functions for 
various heterogeneous data sources, including File, MySQL, Pulsar, Metrics, etc.
+
+### Architecture
+
+![](img/agent_overview_2.png)
+
+The InLong Agent itself serves as a data collection framework. In order to 
facilitate the expansion of data sources, the data sources are abstracted as 
Source plugins and integrated into the entire framework.
+-Source: Source is a data collection module responsible for collecting data 
from the data source.
+-Agent configuration synchronization thread Manager Fetcher pulls from Manager 
to collection configuration
+-Instance: Instance is used to retrieve data from the Source and write it to 
the DataProxy Sink
+
+### Design concept
+In order to address the issue of data source diversity, InLong Agent abstracts 
multiple data sources into a unified Source concept and abstracts a unified 
DataProxy Sink to write data into the InLong link. When a new data source needs 
to be connected, simply configure the format and reading parameters of the data 
source to achieve efficient reading.
+
+### Basic concepts
+
+![](img/agent_overview_3.png)
+
+#### Tasks and instances
+- Task
+Collection tasks configured on behalf of users
+
+- Instance
+Instantiation of collection tasks, generated by Tasks, responsible for 
executing specific collection tasks
+
+Taking file collection as an example, there is a collection task configuration 
on the Manager: ` 127.0.0.1->/data/log/YYYYMMDDhh. log_ [0-9]+'indicates that 
the user needs to collect data that meets the requirements 
of'/data/log/YYYYMMDDhh. log 'on the' 127.0.0.1 'machine_ The data for this 
path rule, [0-9]+, is a task. Assuming there are three files that meet this 
path rule:/data/log/202404021. log. 0,/data/log/202404021. log. 
1,/data/log/202404021. log. 3, Task will generate three inst [...]
+
+#### Source and Sink
+Source and Sink belong to the concept of the lower level of an instance. Each 
instance has a Source and a Sink, where the Source reads data from the data 
source and the Sink writes the data to the target storage. In the InLong 
system, data collected by the Agent will be uniformly written to the DataProxy 
service, which only has DataProxy Sink types.
+
+## Implementation principle of InLong Agent
+### Life cycle
+
+![](img/agent_overview_4.png)
+
+Agent data collection tasks include configuration pulling, Task/Instance 
generation, Task/Instance execution, and other processes. Taking file 
collection as an example, the entire process of collection tasks includes:
+- Step 1: Agent configuration synchronization thread Manager Fetcher pulls 
from Manager to collection configuration, such as Configuration 1 and 
Configuration 2
+- Step 2: Synchronize thread to submit configuration to Task Manager
+- Step 3.1/3.2: Task Manager will generate Task 1 and Task 2 based on Config1 
and Config2
+- Step 4: Task 1 scans files that comply with the rules of Configuration 1, 
such as File 1 and File 2, and submits the information of File 1 and File 2 to 
the instance manager, Instance Manger (where Instance Manager is a member 
variable of Task)
+- Step 5.1/5.2: The Instance Manager generates corresponding instances based 
on the file information of File 1 and File 2, and runs them
+- Step 6.1/6.2: The Source of each instance will collect file data based on 
the file information and send the collected data through Sink. After the file 
collection and transmission are completed, a signal indicating the completion 
of the instance will be sent to the instance manager, triggering the instance 
manager to release the instance
+- Step 7: After detecting the completion of all instances through the Instance 
Manager, the Task Manager will send a signal to complete the Task, triggering 
the Task Manager to release the Task
+
+### State saving
+Agent data collection has a state, and in order to ensure the continuity of 
data collection, it is necessary to save the collected state to prevent the 
task from being unable to recover after the Agent stops unexpectedly. The Agent 
divides states into three categories: Task, Instance, and Offset, corresponding 
to Task task status, Instance instance status, and point status during the 
collection process, respectively. These three types of state data are saved 
through RocksDB and exist in  [...]
+
+![](img/agent_overview_5.png)
+
+The specified Source and Sink class names are saved in the InstanceDB record, 
as the class names may change after the Agent upgrade, such as the Source class 
changing from LogFileSource V1 to NewLogFileSource V1. At the same time, a task 
corresponds to multiple instances, and in order to avoid changes between 
different instances affecting each other, tasks and instances are also placed 
in different DBs. Place Offset in an independent DB to address the issue of 
using the old version's loc [...]
+
+![](img/agent_overview_6.png)
+
+### Data consistency
+#### Offset refresh mechanism
+We adopt a similar "sliding window" algorithm: the Agent can send multiple 
pieces of data before stopping and waiting for confirmation. There is no need 
to stop and wait for confirmation for each piece of data sent, which ensures 
that the "offset is updated only after the ack is successful" and maintains a 
fast sending speed. Taking the collection of four pieces of data as an example:
+- Firstly, Source sequentially reads 4 pieces of data from the data source
+
+![](img/agent_overview_7.png)
+ 
+- Secondly, 4 pieces of data were retrieved from the Source`is sent in an 
orderly manner` Sink, when Sink receives the data `first records the offset of 
the data in the OffsetList and marks it as not sent`.
+
+![](img/agent_overview_8.png)
+
+Then Sink sent 4 pieces of data through the SDK, but only 3 pieces of data, 1, 
2, and 4, returned success. Returning success will `sets the corresponding 
identifier in OffsetList to true`
+
+![](img/agent_overview_9.png)
+
+The offset update thread will traverse the OffsetList and find that Offset3 is 
not acked. Therefore, it will flush the closest Offset2 before Offset3 to the 
storage, ensuring that the data must be successfully sent downstream before 
performing the offset refresh.
+
+#### Restart recovery mechanism
+
+![](img/agent_overview_10.png)
+
+As mentioned above, the status information of Task, Instance, and Offset is 
stored through RocksDB, and it can ensure that the data is successfully sent 
downstream before performing offset refreshing. The restart and recovery of 
collection tasks also depend on the saved state, and the entire process is as 
follows:
+- Step 1: When starting, the Task Manager reads the TaskDB
+- Step 2: Task Manager generates Task 1 and Task 2 based on the configuration 
of Task DB
+- Step 3: Instance Manager reads InstanceDB
+- Step 4: The Instance Manager generates an instance based on the records of 
the InstanceDB
+- Step 5: Instance reads OffsetDB
+- Step 6: Instance initializes the Source based on the OffsetDB configuration 
and restores the Offset
+- Step 7: Regularly update tasks based on Manager configuration
+
+## InLong Agent file collection mechanism
+### Folder scanning
+Scan all the files in the corresponding path and match the rules. Once 
matched, it is considered to be found. However, in the case of a large number 
of files, scanning once takes a long time and consumes resources. If the 
scanning cycle is too small, the resource consumption is too high; If the 
scanning cycle is too long, the response speed will be too slow.
+
+### Folder listening
+The above problem can be solved by listening to folders. We just need to 
register the folder to the listener, and then we can query whether any events 
have occurred through the interface of this listener. The types of listening 
events include adding, deleting, modifying, etc. Usually, we only need to 
listen for the addition of files, but it is easy to make too many 
modifications, while file deletion events can be actively detected during the 
process of reading files. But because listenin [...]
+
+### Combining folder scanning and listening
+In practical applications, we adopt a combination of folder scanning and 
monitoring mode. Simply put, for a folder, we perform both "scheduled scanning" 
and "monitoring" simultaneously, ensuring consistency and fast response speed. 
The specific process is as follows:
+- Firstly, check from the file listener if there are any newly created files. 
If there are, check if they have been cached. If there is no cache, place them 
in the queue to be collected
+- Secondly, if the scanning time interval is met, start scanning the file. If 
a file is scanned, check if it has been cached. If not, place it in the queue 
to be collected
+- Finally, process the file information in the queue to be collected, which is 
to submit it to the Instance Manager
+
+![](img/agent_overview_11.png)
+
+### File reading
+We used the 'RandomAccess File' class to read files, and the instance of 
'RandomAccess File' supports reading and writing to randomly accessed files. 
The behavior of randomly accessing files is similar to a large byte array 
stored in the file system. There is a cursor or index pointing to the implicit 
array, called a file pointer; Start reading bytes from the file pointer and 
move the file pointer forward as the byte is read. For example, the file has a 
total of 13 bytes, and we need to  [...]
+
+![](img/agent_overview_12.png)
\ No newline at end of file
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_10.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_10.png
new file mode 100644
index 0000000000..20ea3f88f4
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_10.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_11.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_11.png
new file mode 100644
index 0000000000..88bb51c27f
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_11.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_12.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_12.png
new file mode 100644
index 0000000000..c7eeb55cbe
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_12.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_2.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_2.png
new file mode 100644
index 0000000000..f31a92febd
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_2.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_3.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_3.png
new file mode 100644
index 0000000000..a21793a190
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_3.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_4.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_4.png
new file mode 100644
index 0000000000..1a40dd7e6c
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_4.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_5.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_5.png
new file mode 100644
index 0000000000..cedd98c5b8
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_5.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_6.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_6.png
new file mode 100644
index 0000000000..d02bf5fef4
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_6.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_7.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_7.png
new file mode 100644
index 0000000000..642c59b1a2
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_7.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_8.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_8.png
new file mode 100644
index 0000000000..c45187ea92
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_8.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_9.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_9.png
new file mode 100644
index 0000000000..d3472f2b15
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/agent_overview_9.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/architecture.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/architecture.png
deleted file mode 100644
index 1138fe1b62..0000000000
Binary files 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/img/architecture.png
 and /dev/null differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/overview.md 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/overview.md
index 6cd8fb5431..bab4dfef6d 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/overview.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/modules/agent/overview.md
@@ -3,51 +3,105 @@ title: 总览
 sidebar_position: 1
 ---
 
-InLong Agent 是一个支持多种数据源类型的收集工具,致力于实现包括 File、Sql、Binlog、Metrics 
等多种异构数据源之间稳定高效的数据采集功能。
-
-## 设计理念
-为了解决数据源多样性问题,InLong-agent 
将多种数据源抽象成统一的source概念,并抽象出sink来对数据进行写入。当需要接入一个新的数据源的时候,只需要配置好数据源的格式与读取参数便能跟做到高效读取。
-
-## InLong-Agent 架构介绍
-![](img/architecture.png)
-
-InLong Agent本身作为数据采集框架,采用channel + 
plugin架构构建。将数据源读取和写入抽象成为Reader/Writer插件,纳入到整个框架中。
-
-- Reader:Reader为数据采集模块,负责采集数据源的数据,将数据发送给channel。
-- Writer: Writer为数据写入模块,负责不断向channel取数据,并将数据写入到目的端。
-- Channel:Channel用于连接reader和writer,作为两者的数据传输通道,并起到了数据的写入读取监控作用
-
-
-## InLong-Agent 采集分类
-### 文件
-文件采集包含如下功能:
-用户配置的路径监听,能够监听出创建的文件信息
-目录正则过滤,支持YYYYMMDD+正则表达式的路径配置
-断点重传,InLong-Agent重启时,能够支持自动从上次读取位置重新读取,保证不重读不漏读。\
-
-#### 文件采集参数
-| 参数                           | 是否必须  | 默认值         | 类型    | 描述              
                                    |
-| ----------------------------- | -------- | ------------- | ------ | 
------------------------------------------------------------ |
-| pattern                       | required | (none)        | String | 
文件正则匹配,例如: /root/[*].log      |
-| timeOffset                    | optional | (none)        | String | 
文件偏移匹配针对文件文件名称为: *** YYYYMMDDHH *** YYYY 表示年, MM 表示月, DD 表示天,  HH 表示小时, *** 
表示任意的字符;'1m' 表示一分钟以后, '-1m' 表示一分钟以前, '1h' 一小时以后, '-1h' 一小时以前, '1d' 一天以后, '-1d' 
一天以前。|
-| collectType                   | optional |  FULL         | String | "FULL" 
目录下所有匹配的文件, "INCREMENT" 任务启动后匹配新增的文件。                      |
-| lineEndPattern                | optional | '\n'          | String | 
文件行结束正则匹配。 |
-| contentCollectType            | optional |  FULL         | String | 
文件内容采集方式全量"FULL"、增量"INCREMENT" 。|
-| envList                       | optional | (none)        | String | 
文件采集携带环境信息,例如在容器环境下: kubernetes 。           |
-| dataContentStyle              | optional | (none)        | String | 
采集后数据输出方式, Json 格式设置为 json ; CSV 格式设置分割类型: `,` &#124; `:`            |
-| dataSeparator                 | optional | (none)        | String | 
文件数据原始列分割方式。           |
-| monitorStatus                 | optional | (none)        | Integer| 文件监控开关 1 
开启 、 0 关闭。场景:在批量采集是设置为 0,实时数据采集时 1。 |
-| monitorInterval               | optional | (none)        | Long   | 
文件监控探测频率,毫秒/单位 |
-| monitorExpire                 | optional | (none)        | Long   | 
文件监控探测过期时间,毫秒/单位 |
-
-
-### Sql
-这类数据是指通过SQL执行的方式
-SQL正则分解,转化成多条SQL语句
-分别执行SQL,拉取数据集,拉取过程需要注意对mysql本身的影响
-执行周期,这种一般是定时执行
-
-### Binlog
-这类采集通过配置mysql slave的方式,读取binlog,并还原数据
-需要注意binlog读取的时候多线程解析,多线程解析的数据需要打上顺序标签
-代码基于老版本的dbsync,主要的修改是将tdbus-sender的发送改为推送到agent-channel的方式做融合
\ No newline at end of file
+InLong Agent 属于 InLong 数据链路的采集层,是一个支持多种数据源类型的收集工具,致力于实现包括 
File、MySQL、Pulsar、Metrics 等多种异构数据源稳定高效的数据采集功能。
+
+### 整体架构
+
+![](img/agent_overview_2.png)
+
+InLong Agent 本身作为数据采集框架,为了方便扩展数据源,将数据源抽象成为 Source 插件,纳入到整个框架中。
+- Source:Source 为数据采集模块,负责采集数据源的数据。
+- Agent 配置同步线程 Manager Fetcher 从 Manager 拉到采集配置
+- Instance:Instance 用于将数据从 Source 取出并写入到 DataProxy Sink
+
+### 设计理念
+为了解决数据源多样性问题,InLong Agent 将多种数据源抽象成统一的 Source 概念,并抽象统一的 DataProxy Sink 将数据写入 
InLong 链路。当需要接入一个新的数据源的时候,只需要配置好数据源的格式与读取参数便能做到高效读取。
+
+### 基本概念
+
+![](img/agent_overview_3.png)
+
+#### Task 和 Instance
+- Task
+  代表用户配置的采集任务
+
+- Instance
+  采集任务的实例化,由 Task 生成,负责具体执行采集任务
+
+以文件采集为例,Manager 上有个采集任务的配置: `127.0.0.1 -> 
/data/log/YYYYMMDDhh.log._[0-9]+`,表示用户需要在 `127.0.0.1` 这台机器上采集符合 
`/data/log/YYYYMMDDhh.log._[0-9]+` 这个路径规则的数据,`这就是一个 Task`。假设满足这个路径规则的文件有 3 
个:/data/log/2024040221.log.0,/data/log/2024040221.log.1,/data/log/2024040221.log.3,
 Task 会生成 3 个 Instance 分别采集这 3 个文件。
+
+#### Source 和 Sink
+Source 和 Sink 属于 Instance 下一级的概念,每个 Instance 都有一个 Source 和一个 Sink,Source 
从数据源读取数据,Sink 把数据写入目标存储。在 InLong 体系中,数据经过 Agent 采集会统一写入 DataProxy 服务,即只有  
DataProxy Sink 类型。
+
+## InLong Agent 实现原理
+### 生命周期
+
+![](img/agent_overview_4.png)
+
+Agent 数据采集任务包括配置拉取、Task/Instance 生成、Task/Instance 执行等过程。以文件采集为例,采集任务的整个过程包括:
+- Step 1: Agent 配置同步线程 Manager Fetcher 从 Manager 拉到采集配置,比如 Config 1、Config 2
+- Step 2: 同步线程将配置提交到任务管理器 Task Manager
+- Step 3.1/3.2: Task Manager 会根据 Config1 和 Config 2 生成 Task 1、Task 2
+- Step 4: Task 1 根据 Config 1,扫描到符合 Config 1 规则的文件,比如 File 1、File 2,将 File 
1、File 2 的信息提交到实例管理器 Instance Manger(Instance Manager 是  Task 的成员变量)
+- Step 5.1/5.2: Instance Manager 根据  File 1、File 2 的文件信息生成对应的 Instance,并运行
+- Step 6.1/6.2: Instance 各自的 Source 将会根据文件信息去采集文件数据,并将采集到的数据通过 Sink 
发送出去。文件采集、发送完成后将向 Instance Manager 发送 Instance 完成的信号,触发 Instance Manager 释放 
Instance
+- Step 7: Task 通过 Instance Manager 检测到所有的 Instance 执行完成后将向 Task Manager 发送 
Task 完成的信号,触发 Task Manager 释放 Task
+
+### 状态保存
+Agent 数据采集有状态,为了保证采集数据的连续性,需要对采集的状态进行保存,防止 Agent 意外停止后任务无法恢复。Agent 
将状态分成三大类:Task、Instance 和 Offset,分别对应 Task 任务状态、Instance 
实例状态及采集过程中的位点状态。这三类状态数据通过 RocksDB 保存,存在在三个不同的 DB 目录。
+
+![](img/agent_overview_5.png)
+
+InstanceDB 记录里保存着指定的 Source、Sink 类名,这是由于 Agent 升级后类名有可能发生变化,比如 Source class 由 
LogFileSourceV1 变成 NewLogFileSourceV1。同时,一个 Task 会对应多个 Instance,为了避免不同的 
Instance 之间的变更互相影响,将 Task 和 Instance 也放到了不同的 DB。将 Offset 放到独立 DB,为了解决 Agent 
进行版本升级时能使用老版本的位点信息。
+
+![](img/agent_overview_6.png)
+
+### 数据一致性
+#### Offset 刷新机制
+我们采取的是类似 “滑动窗口” 算法:Agent 在停止并等待确认前可以发送多条数据,不必每发一条数据就停下来等待确认,既确保了 ”ack 成功才更新 
Offset“ 又能保持较快的发送速度。下面以采集 4 条数据为例:
+- 首先,Source 有序从数据源读到 4 条数据
+  
+![](img/agent_overview_7.png)
+ 
+- 其次,从 Source 取了 4 条数据!!#ff0000 有序发往!! Sink,Sink 在接到数据时!!#ff0000 首先将数据的 Offset 
记录到 OffsetList,并标识为未发送!!
+
+![](img/agent_overview_8.png)
+
+  然后 Sink 将 4 条数据通过 SDK 发送,但是只有 1、2、4 三条数据返回成功,返回成功会!!#ff0000 将 OffsetList 
中对应的标识置为 true!!
+
+  ![](img/agent_overview_9.png)
+
+  Offset 更新线程则会遍历 OffsetList 发现 Offset 3 未 ack,于是就将 Offset 3 之前最近的 Offset 2 
刷新到存储,这就保证了`数据一定是成功发送到下游之后才做 Offset 刷新`。
+
+#### 重启恢复机制
+
+![](img/agent_overview_10.png)
+
+如上所述,Task、Instance 和 Offset 的状态信息通过 RocksDB 存储,并且能保证数据一定成功发送到下游后才做 Offset 
刷新。采集任务的重启恢复,也是依赖保存的状态,整个过程如下:
+- Step 1: 启动时 Task Manager 读取 TaskDB
+- Step 2: Task Manager 根据 TaskDB 的配置生成 Task 1、Task 2
+- Step 3: Instance Manager 读取 InstanceDB
+- Step 4: Instance Manager 根据 InstanceDB 的记录生成 Instance
+- Step 5: Instance 读取 OffsetDB
+- Step 6: Instance 根据 OffsetDB 的配置对 Source 进行初始化,恢复 Offset
+- Step 7: 定时根据 Manager 配置更新任务
+
+## InLong Agent 文件采集机制
+### 文件夹扫描
+把相应路径下的文件都扫一遍,然后匹配一下规则,匹配上就算是找到了。但是文件数量较多的情况下,扫描一遍需要较长的时间、也比较耗资源,扫描周期太小则资源消耗过大;扫描周期太大则响应速度过慢。
+
+### 文件夹监听
+上面的问题可以通过文件夹的监听来解决。我们只需要把文件夹注册到监听器,然后就可以通过这个监听器的接口查询是否有事件发生。监听的事件类型有增加、删除、修改等。一般我们监听文件的增加就可以,修改的很容易过多,而文件的删除事件则可以在文件读文件的过程中主动发现。但因为监听事件是触发式的,容易出现一致性问题。
+
+### 文件夹扫描和监听结合
+在实际应用中我们采取的是文件夹扫描和监听结合的模式,简单说就是对于一个文件夹我们同时做了 “定时扫描” 和 
“监听”,这样既确保了一致性又能有较快的响应速度,具体过程如下:
+- 首先,从文件监听器查询是否有新建文件,有则再查询是否已经缓存,没有缓存则放入待采集队列
+- 其次,如果扫描时间间隔满足,则开始扫描文件,如果扫描到文件则再查询文件是否已经缓存,没有则放入待采集队列
+- 最后,再处理待采集队列里的文件信息,也就是将其提交到 Instance Manager
+
+![](img/agent_overview_11.png)
+
+### 文件读取
+我们使用了 `RandomAccessFile` 类来读取文件, `RandomAccessFile` 
的实例支持对随机访问文件的读取和写入。随机访问文件的行为类似存储在文件系统中的一个大型 byte 
数组。存在指向该隐含数组的光标或索引,称为文件指针;从文件指针开始读取字节,并随着对字节的读取而前移此文件指针。举个例子:文件共有 13 字节,我们需要从 
Offset 为 4 的地方开始读取 3 个字节。我们只需要把文件指针指向 Offset 为 4 的地方,然后读取 3 个字节即可。
+
+![](img/agent_overview_12.png)
\ No newline at end of file


Reply via email to