Re: [PR] [pip] PIP-459: Batch Status Summary and Filtered Listing for Pulsar Functions [pulsar]

via GitHub Thu, 12 Mar 2026 05:43:36 -0700


shibd commented on code in PR #25299:
URL: https://github.com/apache/pulsar/pull/25299#discussion_r2924252941



##########
pip/pip-459.md:
##########
@@ -0,0 +1,331 @@
+# PIP-459: Batch Status Summary and Filtered Listing for Pulsar Functions
+
+# Background knowledge
+
+Pulsar Functions are managed by the Functions worker and exposed through the 
Pulsar Admin API and `pulsar-admin` CLI. Today, listing functions in a 
namespace returns only function names. To understand runtime health, operators 
must fetch function status for each function separately. That creates an N+1 
request pattern: one request to list function names and one additional request 
per function to fetch status. In practice, this is slow, noisy, and hard to use 
in scripts or daily operations.
+
+Function runtime status is already represented by `FunctionStatus`, which 
contains aggregate fields such as the total configured instances and the number 
of running instances. This proposal introduces a lightweight namespace-level 
summary model built from those counts. The summary is intentionally smaller 
than the full status payload: it is designed for listing and filtering, not for 
per-instance inspection.
+
+The proposal also has to account for mixed-version deployments. A new client 
may talk to an older worker that does not expose the new summary endpoint yet. 
In that case, the client must degrade safely instead of failing outright. That 
compatibility requirement affects both the public admin interface and the 
client-side implementation strategy.
+
+# Motivation
+
+`pulsar-admin functions list` currently cannot answer a basic operational 
question: which functions in this namespace are actually running. Operators 
must either inspect each function one by one or write shell loops that issue 
many admin calls. The problem becomes more visible in namespaces containing 
dozens or hundreds of functions, where a simple health check becomes expensive 
and slow.
+
+The initial feature request was to add state-aware listing to the CLI, but the 
implementation uncovered three broader issues that should be addressed together:
+
+1. The current user experience requires N+1 admin calls for namespace-level 
health inspection.
+2. A new summary endpoint must not break new clients when they talk to older 
workers during rolling upgrades.
+3. Namespace-level summary generation can become slow if the worker builds 
results strictly serially.
+
+This PIP proposes a batch status summary API for Pulsar Functions and 
integrates it into the admin client, CLI, and worker implementation in a 
backward-compatible way.
+
+# Goals
+
+## In Scope
+
+- Add a namespace-level batch status summary API for Pulsar Functions.
+- Add a lightweight public data model that returns function name, derived 
state, instance counts, and failure classification.
+- Add admin client support for the new summary API, including fallback to 
legacy workers that do not implement it.
+- Add CLI support for long-format listing and state-based filtering using the 
batch summary API.
+- Add pagination support for namespace-level function summaries.
+- Improve worker-side summary generation latency by using controlled 
parallelism.
+- Add a worker configuration knob to cap summary-query parallelism.
+- Add a worker metric to observe summary-query execution time.
+
+## Out of Scope
+
+- Changing the existing `functions list` endpoint that returns only function 
names.
+- Returning full per-instance function status from the new namespace-level 
endpoint.
+- Adding equivalent summary endpoints for sources or sinks in this PIP.
+- Adding server-side filtering by state in the REST endpoint.
+- Reworking the underlying function runtime status model or scheduler behavior.
+
+
+# High Level Design
+
+This proposal adds a new REST endpoint:
+
+`GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+
+The endpoint returns a list of `FunctionStatusSummary` objects. Each object 
contains:
+
+- `name`
+- `state`: `RUNNING`, `STOPPED`, `PARTIAL`,  `UNKNOWN`
+- `numInstances`
+- `numRunning`
+- `error`
+- `errorType`
+
+The server remains a generic summary provider. It does not apply state 
filtering. The CLI consumes the summary list and performs presentation concerns 
locally, such as `--state` filtering and `--long` formatting. This separation 
keeps the endpoint reusable for other clients and prevents coupling the REST 
contract to one CLI presentation format.
+
+For compatibility, the admin client first tries the new summary endpoint. If 
the server responds with `404 Not Found` or `405 Method Not Allowed`, the 
client falls back to the legacy flow: fetch the function names, apply 
name-based pagination, and then query each function status individually to 
build summaries client-side. This allows a new client to work against older 
workers during mixed-version upgrades.
+
+On the worker side, summary generation is executed only for the requested page 
and uses a bounded thread pool. A new worker configuration, 
`functionsStatusSummaryMaxParallelism`, limits how many function status lookups 
may run concurrently for a single summary request.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+### Data model
+
+A new public model, `FunctionStatusSummary`, is added under the admin API data 
package. It intentionally returns only aggregate listing information:
+
+```java
+public class FunctionStatusSummary {
+    public enum SummaryState {
+        RUNNING,
+        STOPPED,
+        PARTIAL,
+        UNKNOWN
+    }
+
+    public enum ErrorType {
+        AUTHENTICATION_FAILED,
+        FUNCTION_NOT_FOUND,
+        NETWORK_ERROR,
+        INTERNAL_ERROR
+    }
+
+    private String name;
+    private SummaryState state;
+    private int numInstances;
+    private int numRunning;
+    private String error;
+    private ErrorType errorType;

Review Comment:
   hi, @onceMisery Thanks for PIP, Have we considered including a bit more 
information in this response, for example:
   
   ```
   - `receivedTotal`
   - `processedSuccessfullyTotal`
   - `systemExceptionsTotal`
   - `userExceptionsTotal`
   - `avgProcessLatency`
   - `userMetrics`
   ```
   
   These values are already aggregate, and they could give operators a more 
direct view of a function's actual health instead of relying only on `RUNNING / 
STOPPED / PARTIAL / UNKNOWN`.
   
   Another option would be to keep the default response lightweight, but add a 
query parameter to control the level of detail returned by the REST API. That 
would let us preserve the current "summary" use case while still supporting a 
more diagnostic view when needed.
   
   



##########
pip/pip-459.md:
##########
@@ -0,0 +1,329 @@
+# PIP-459: Batch Status Summary and Filtered Listing for Pulsar Functions
+
+# Background knowledge
+
+Pulsar Functions are managed by the Functions worker and exposed through the 
Pulsar Admin API and `pulsar-admin` CLI. Today, listing functions in a 
namespace returns only function names. To understand runtime health, operators 
must fetch function status for each function separately. That creates an N+1 
request pattern: one request to list function names and one additional request 
per function to fetch status. In practice, this is slow, noisy, and hard to use 
in scripts or daily operations.
+
+Function runtime status is already represented by `FunctionStatus`, which 
contains aggregate fields such as the total configured instances and the number 
of running instances. This proposal introduces a lightweight namespace-level 
summary model built from those counts. The summary is intentionally smaller 
than the full status payload: it is designed for listing and filtering, not for 
per-instance inspection.
+
+The proposal also has to account for mixed-version deployments. A new client 
may talk to an older worker that does not expose the new summary endpoint yet. 
In that case, the client must degrade safely instead of failing outright. That 
compatibility requirement affects both the public admin interface and the 
client-side implementation strategy.
+
+# Motivation
+
+`pulsar-admin functions list` currently cannot answer a basic operational 
question: which functions in this namespace are actually running. Operators 
must either inspect each function one by one or write shell loops that issue 
many admin calls. The problem becomes more visible in namespaces containing 
dozens or hundreds of functions, where a simple health check becomes expensive 
and slow.
+
+The initial feature request was to add state-aware listing to the CLI, but the 
implementation uncovered three broader issues that should be addressed together:
+
+1. The current user experience requires N+1 admin calls for namespace-level 
health inspection.
+2. A new summary endpoint must not break new clients when they talk to older 
workers during rolling upgrades.
+3. Namespace-level summary generation can become slow if the worker builds 
results strictly serially.
+
+This PIP proposes a batch status summary API for Pulsar Functions and 
integrates it into the admin client, CLI, and worker implementation in a 
backward-compatible way.
+
+# Goals
+
+## In Scope
+
+- Add a namespace-level batch status summary API for Pulsar Functions.
+- Add a lightweight public data model that returns function name, derived 
state, instance counts, and failure classification.
+- Add admin client support for the new summary API, including fallback to 
legacy workers that do not implement it.
+- Add CLI support for long-format listing and state-based filtering using the 
batch summary API.
+- Add pagination support for namespace-level function summaries.
+- Improve worker-side summary generation latency by using controlled 
parallelism.
+- Add a worker configuration knob to cap summary-query parallelism.
+- Add a worker metric to observe summary-query execution time.
+
+## Out of Scope
+
+- Changing the existing `functions list` endpoint that returns only function 
names.
+- Returning full per-instance function status from the new namespace-level 
endpoint.
+- Adding equivalent summary endpoints for sources or sinks in this PIP.
+- Adding server-side filtering by state in the REST endpoint.
+- Reworking the underlying function runtime status model or scheduler behavior.
+
+
+# High Level Design
+
+This proposal adds a new REST endpoint:
+
+`GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+
+The endpoint returns a list of `FunctionStatusSummary` objects. Each object 
contains:
+
+- `name`
+- `state`: `RUNNING`, `STOPPED`, `PARTIAL`,  `UNKNOWN`
+- `numInstances`
+- `numRunning`
+- `error`
+- `errorType`
+
+The server remains a generic summary provider. It does not apply state 
filtering. The CLI consumes the summary list and performs presentation concerns 
locally, such as `--state` filtering and `--long` formatting. This separation 
keeps the endpoint reusable for other clients and prevents coupling the REST 
contract to one CLI presentation format.
+
+For compatibility, the admin client first tries the new summary endpoint. If 
the server responds with `404 Not Found` or `405 Method Not Allowed`, the 
client falls back to the legacy flow: fetch the function names, apply 
name-based pagination, and then query each function status individually to 
build summaries client-side. This allows a new client to work against older 
workers during mixed-version upgrades.
+
+On the worker side, summary generation is executed only for the requested page 
and uses a bounded thread pool. A new worker configuration, 
`functionsStatusSummaryMaxParallelism`, limits how many function status lookups 
may run concurrently for a single summary request.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+### Data model
+
+A new public model, `FunctionStatusSummary`, is added under the admin API data 
package. It intentionally returns only aggregate listing information:
+
+```java
+public class FunctionStatusSummary {
+    public enum SummaryState {
+        RUNNING,
+        STOPPED,
+        PARTIAL,
+        UNKNOWN
+    }
+
+    public enum ErrorType {
+        AUTHENTICATION_FAILED,
+        FUNCTION_NOT_FOUND,
+        NETWORK_ERROR,
+        INTERNAL_ERROR
+    }
+
+    private String name;
+    private SummaryState state;
+    private int numInstances;
+    private int numRunning;
+    private String error;
+    private ErrorType errorType;
+}
+```
+
+`state` is derived from aggregate instance counts:
+
+- `RUNNING`: `numRunning == numInstances` and `numInstances > 0`
+- `STOPPED`: `numRunning == 0` and `numInstances > 0`
+- `PARTIAL`: `0 < numRunning < numInstances`
+- `UNKNOWN`: the status query failed or the instance counts are not meaningful
+
+### Admin API interface compatibility
+
+The public `Functions` admin interface gains namespace-level summary methods:
+
+- `getFunctionsWithStatus(String tenant, String namespace)`
+- `getFunctionsWithStatusAsync(String tenant, String namespace)`
+- `getFunctionsWithStatus(String tenant, String namespace, Integer limit, 
String continuationToken)`
+- `getFunctionsWithStatusAsync(String tenant, String namespace, Integer limit, 
String continuationToken)`
+
+These methods are introduced as `default` methods. This is important because 
`Functions` is a public interface and can be implemented outside the Pulsar 
repository. Adding abstract methods would break source or binary compatibility 
for custom implementations. Using `default` methods preserves compatibility 
while still exposing the new capability.
+
+The default implementation also provides a compatibility fallback path by 
using the legacy list-plus-status flow if a server-side implementation is 
unavailable.
+
+### REST endpoint
+
+The worker exposes a new endpoint:
+
+`GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+
+The endpoint accepts two optional query parameters:
+
+- `limit`: maximum number of functions to return; must be greater than `0` 
when present
+- `continuationToken`: exclusive cursor based on function name in 
lexicographical order

Review Comment:
   Hi @onceMisery,
   https://github.com/apache/pulsar/pull/25299#issuecomment-4024172832
   
   This design makes sense to me. It can reduce the number of requests needed 
to fetch function stats. The only trade-off is that each worker needs to first 
list the functions under the namespace and sort them lexicographically.
   
   1. Should we rename this parameter to something more explicit, such as 
startAfterFunctionName?
   
   2. We should also clearly document the pagination behavior in the API, 
similar to how you described it in your comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [pip] PIP-459: Batch Status Summary and Filtered Listing for Pulsar Functions [pulsar]

Reply via email to