davidzollo opened a new issue, #10666: URL: https://github.com/apache/seatunnel/issues/10666
# GitHub Issue Draft ## Repo `apache/seatunnel` ## Proposed Title `[Feature][Zeta] Support lightweight edge collector clients for remote data collection` ## Proposed Body ```md ### Search before asking I searched existing feature requests and did not find a similar proposal for a lightweight edge collector / remote agent model for SeaTunnel Zeta. ### Description I would like to start a discussion about whether SeaTunnel Zeta should support a lightweight edge collector client that runs on remote hosts, collects local data, and sends the data stream into a central Zeta cluster for transform and sink processing. The main goal is not to replace the current SeaTunnel job client. Instead, this would introduce an edge-side collection model for scenarios where the data source is only reachable from remote hosts, or where users want a very small local process for collection while keeping scheduling, transformation, checkpoint coordination, and sink execution centralized in Zeta. Typical examples: * collecting local files or logs from remote machines * collecting application events or metrics from private network hosts * collecting data through custom local SDKs or internal protocols that should not run directly inside the Zeta cluster Today, SeaTunnel already has: * an engine client for job submission * source connectors that run inside worker tasks * a socket connector that can prove basic network ingestion However, there is no first-class model for a lightweight remote collector that focuses only on collection + buffering + transport. I think this could be useful if SeaTunnel wants to support an "edge collection, central processing" architecture. ### Usage Scenario One example is a company that has many business hosts in isolated network zones. Those hosts can access local files, local applications, or internal services, but the central SeaTunnel Zeta cluster cannot directly access those sources. In that case, a lightweight collector could: * run as a small daemon or sidecar on the remote host * collect data locally * buffer and retry locally * securely send batches to a Zeta-side ingress endpoint Then the Zeta cluster would still be responsible for: * pipeline execution * transform logic * checkpoint and recovery coordination * downstream sink delivery This would be especially helpful for: * edge log collection * remote file ingestion * custom event collection * environments with strict network isolation ### Related issues I found the existing socket-related issue below, but it does not seem to cover this broader feature proposal: * #10528 ### Additional discussion points If the community thinks this direction makes sense, I think the discussion should focus on: * whether this should be a new `agent-source` / ingress model instead of extending the current job client * what delivery guarantee should be the MVP: at-least-once vs exactly-once * whether the first version should target logs/files/custom event sources first, instead of CDC/database scenarios * how to keep the design compatible with the current Zeta source/checkpoint model I am opening this issue mainly for design discussion first. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
