shibd commented on code in PR #25299:
URL: https://github.com/apache/pulsar/pull/25299#discussion_r2924252941
##########
pip/pip-459.md:
##########
@@ -0,0 +1,331 @@
+# PIP-459: Batch Status Summary and Filtered Listing for Pulsar Functions
+
+# Background knowledge
+
+Pulsar Functions are managed by the Functions worker and exposed through the
Pulsar Admin API and `pulsar-admin` CLI. Today, listing functions in a
namespace returns only function names. To understand runtime health, operators
must fetch function status for each function separately. That creates an N+1
request pattern: one request to list function names and one additional request
per function to fetch status. In practice, this is slow, noisy, and hard to use
in scripts or daily operations.
+
+Function runtime status is already represented by `FunctionStatus`, which
contains aggregate fields such as the total configured instances and the number
of running instances. This proposal introduces a lightweight namespace-level
summary model built from those counts. The summary is intentionally smaller
than the full status payload: it is designed for listing and filtering, not for
per-instance inspection.
+
+The proposal also has to account for mixed-version deployments. A new client
may talk to an older worker that does not expose the new summary endpoint yet.
In that case, the client must degrade safely instead of failing outright. That
compatibility requirement affects both the public admin interface and the
client-side implementation strategy.
+
+# Motivation
+
+`pulsar-admin functions list` currently cannot answer a basic operational
question: which functions in this namespace are actually running. Operators
must either inspect each function one by one or write shell loops that issue
many admin calls. The problem becomes more visible in namespaces containing
dozens or hundreds of functions, where a simple health check becomes expensive
and slow.
+
+The initial feature request was to add state-aware listing to the CLI, but the
implementation uncovered three broader issues that should be addressed together:
+
+1. The current user experience requires N+1 admin calls for namespace-level
health inspection.
+2. A new summary endpoint must not break new clients when they talk to older
workers during rolling upgrades.
+3. Namespace-level summary generation can become slow if the worker builds
results strictly serially.
+
+This PIP proposes a batch status summary API for Pulsar Functions and
integrates it into the admin client, CLI, and worker implementation in a
backward-compatible way.
+
+# Goals
+
+## In Scope
+
+- Add a namespace-level batch status summary API for Pulsar Functions.
+- Add a lightweight public data model that returns function name, derived
state, instance counts, and failure classification.
+- Add admin client support for the new summary API, including fallback to
legacy workers that do not implement it.
+- Add CLI support for long-format listing and state-based filtering using the
batch summary API.
+- Add pagination support for namespace-level function summaries.
+- Improve worker-side summary generation latency by using controlled
parallelism.
+- Add a worker configuration knob to cap summary-query parallelism.
+- Add a worker metric to observe summary-query execution time.
+
+## Out of Scope
+
+- Changing the existing `functions list` endpoint that returns only function
names.
+- Returning full per-instance function status from the new namespace-level
endpoint.
+- Adding equivalent summary endpoints for sources or sinks in this PIP.
+- Adding server-side filtering by state in the REST endpoint.
+- Reworking the underlying function runtime status model or scheduler behavior.
+
+
+# High Level Design
+
+This proposal adds a new REST endpoint:
+
+`GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+
+The endpoint returns a list of `FunctionStatusSummary` objects. Each object
contains:
+
+- `name`
+- `state`: `RUNNING`, `STOPPED`, `PARTIAL`, `UNKNOWN`
+- `numInstances`
+- `numRunning`
+- `error`
+- `errorType`
+
+The server remains a generic summary provider. It does not apply state
filtering. The CLI consumes the summary list and performs presentation concerns
locally, such as `--state` filtering and `--long` formatting. This separation
keeps the endpoint reusable for other clients and prevents coupling the REST
contract to one CLI presentation format.
+
+For compatibility, the admin client first tries the new summary endpoint. If
the server responds with `404 Not Found` or `405 Method Not Allowed`, the
client falls back to the legacy flow: fetch the function names, apply
name-based pagination, and then query each function status individually to
build summaries client-side. This allows a new client to work against older
workers during mixed-version upgrades.
+
+On the worker side, summary generation is executed only for the requested page
and uses a bounded thread pool. A new worker configuration,
`functionsStatusSummaryMaxParallelism`, limits how many function status lookups
may run concurrently for a single summary request.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+### Data model
+
+A new public model, `FunctionStatusSummary`, is added under the admin API data
package. It intentionally returns only aggregate listing information:
+
+```java
+public class FunctionStatusSummary {
+ public enum SummaryState {
+ RUNNING,
+ STOPPED,
+ PARTIAL,
+ UNKNOWN
+ }
+
+ public enum ErrorType {
+ AUTHENTICATION_FAILED,
+ FUNCTION_NOT_FOUND,
+ NETWORK_ERROR,
+ INTERNAL_ERROR
+ }
+
+ private String name;
+ private SummaryState state;
+ private int numInstances;
+ private int numRunning;
+ private String error;
+ private ErrorType errorType;
Review Comment:
hi, @onceMisery Thanks for PIP, Have we considered including a bit more
information in this response, for example:
```
- `receivedTotal`
- `processedSuccessfullyTotal`
- `systemExceptionsTotal`
- `userExceptionsTotal`
- `avgProcessLatency`
- `userMetrics`
```
These values are already aggregate, and they could give operators a more
direct view of a function's actual health instead of relying only on `RUNNING /
STOPPED / PARTIAL / UNKNOWN`.
Another option would be to keep the default response lightweight, but add a
query parameter to control the level of detail returned by the REST API. That
would let us preserve the current "summary" use case while still supporting a
more diagnostic view when needed.
##########
pip/pip-459.md:
##########
@@ -0,0 +1,329 @@
+# PIP-459: Batch Status Summary and Filtered Listing for Pulsar Functions
+
+# Background knowledge
+
+Pulsar Functions are managed by the Functions worker and exposed through the
Pulsar Admin API and `pulsar-admin` CLI. Today, listing functions in a
namespace returns only function names. To understand runtime health, operators
must fetch function status for each function separately. That creates an N+1
request pattern: one request to list function names and one additional request
per function to fetch status. In practice, this is slow, noisy, and hard to use
in scripts or daily operations.
+
+Function runtime status is already represented by `FunctionStatus`, which
contains aggregate fields such as the total configured instances and the number
of running instances. This proposal introduces a lightweight namespace-level
summary model built from those counts. The summary is intentionally smaller
than the full status payload: it is designed for listing and filtering, not for
per-instance inspection.
+
+The proposal also has to account for mixed-version deployments. A new client
may talk to an older worker that does not expose the new summary endpoint yet.
In that case, the client must degrade safely instead of failing outright. That
compatibility requirement affects both the public admin interface and the
client-side implementation strategy.
+
+# Motivation
+
+`pulsar-admin functions list` currently cannot answer a basic operational
question: which functions in this namespace are actually running. Operators
must either inspect each function one by one or write shell loops that issue
many admin calls. The problem becomes more visible in namespaces containing
dozens or hundreds of functions, where a simple health check becomes expensive
and slow.
+
+The initial feature request was to add state-aware listing to the CLI, but the
implementation uncovered three broader issues that should be addressed together:
+
+1. The current user experience requires N+1 admin calls for namespace-level
health inspection.
+2. A new summary endpoint must not break new clients when they talk to older
workers during rolling upgrades.
+3. Namespace-level summary generation can become slow if the worker builds
results strictly serially.
+
+This PIP proposes a batch status summary API for Pulsar Functions and
integrates it into the admin client, CLI, and worker implementation in a
backward-compatible way.
+
+# Goals
+
+## In Scope
+
+- Add a namespace-level batch status summary API for Pulsar Functions.
+- Add a lightweight public data model that returns function name, derived
state, instance counts, and failure classification.
+- Add admin client support for the new summary API, including fallback to
legacy workers that do not implement it.
+- Add CLI support for long-format listing and state-based filtering using the
batch summary API.
+- Add pagination support for namespace-level function summaries.
+- Improve worker-side summary generation latency by using controlled
parallelism.
+- Add a worker configuration knob to cap summary-query parallelism.
+- Add a worker metric to observe summary-query execution time.
+
+## Out of Scope
+
+- Changing the existing `functions list` endpoint that returns only function
names.
+- Returning full per-instance function status from the new namespace-level
endpoint.
+- Adding equivalent summary endpoints for sources or sinks in this PIP.
+- Adding server-side filtering by state in the REST endpoint.
+- Reworking the underlying function runtime status model or scheduler behavior.
+
+
+# High Level Design
+
+This proposal adds a new REST endpoint:
+
+`GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+
+The endpoint returns a list of `FunctionStatusSummary` objects. Each object
contains:
+
+- `name`
+- `state`: `RUNNING`, `STOPPED`, `PARTIAL`, `UNKNOWN`
+- `numInstances`
+- `numRunning`
+- `error`
+- `errorType`
+
+The server remains a generic summary provider. It does not apply state
filtering. The CLI consumes the summary list and performs presentation concerns
locally, such as `--state` filtering and `--long` formatting. This separation
keeps the endpoint reusable for other clients and prevents coupling the REST
contract to one CLI presentation format.
+
+For compatibility, the admin client first tries the new summary endpoint. If
the server responds with `404 Not Found` or `405 Method Not Allowed`, the
client falls back to the legacy flow: fetch the function names, apply
name-based pagination, and then query each function status individually to
build summaries client-side. This allows a new client to work against older
workers during mixed-version upgrades.
+
+On the worker side, summary generation is executed only for the requested page
and uses a bounded thread pool. A new worker configuration,
`functionsStatusSummaryMaxParallelism`, limits how many function status lookups
may run concurrently for a single summary request.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+### Data model
+
+A new public model, `FunctionStatusSummary`, is added under the admin API data
package. It intentionally returns only aggregate listing information:
+
+```java
+public class FunctionStatusSummary {
+ public enum SummaryState {
+ RUNNING,
+ STOPPED,
+ PARTIAL,
+ UNKNOWN
+ }
+
+ public enum ErrorType {
+ AUTHENTICATION_FAILED,
+ FUNCTION_NOT_FOUND,
+ NETWORK_ERROR,
+ INTERNAL_ERROR
+ }
+
+ private String name;
+ private SummaryState state;
+ private int numInstances;
+ private int numRunning;
+ private String error;
+ private ErrorType errorType;
+}
+```
+
+`state` is derived from aggregate instance counts:
+
+- `RUNNING`: `numRunning == numInstances` and `numInstances > 0`
+- `STOPPED`: `numRunning == 0` and `numInstances > 0`
+- `PARTIAL`: `0 < numRunning < numInstances`
+- `UNKNOWN`: the status query failed or the instance counts are not meaningful
+
+### Admin API interface compatibility
+
+The public `Functions` admin interface gains namespace-level summary methods:
+
+- `getFunctionsWithStatus(String tenant, String namespace)`
+- `getFunctionsWithStatusAsync(String tenant, String namespace)`
+- `getFunctionsWithStatus(String tenant, String namespace, Integer limit,
String continuationToken)`
+- `getFunctionsWithStatusAsync(String tenant, String namespace, Integer limit,
String continuationToken)`
+
+These methods are introduced as `default` methods. This is important because
`Functions` is a public interface and can be implemented outside the Pulsar
repository. Adding abstract methods would break source or binary compatibility
for custom implementations. Using `default` methods preserves compatibility
while still exposing the new capability.
+
+The default implementation also provides a compatibility fallback path by
using the legacy list-plus-status flow if a server-side implementation is
unavailable.
+
+### REST endpoint
+
+The worker exposes a new endpoint:
+
+`GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+
+The endpoint accepts two optional query parameters:
+
+- `limit`: maximum number of functions to return; must be greater than `0`
when present
+- `continuationToken`: exclusive cursor based on function name in
lexicographical order
Review Comment:
Hi @onceMisery,
https://github.com/apache/pulsar/pull/25299#issuecomment-4024172832
This design makes sense to me. It can reduce the number of requests needed
to fetch function stats. The only trade-off is that each worker needs to first
list the functions under the namespace and sort them lexicographically.
1. Should we rename this parameter to something more explicit, such as
startAfterFunctionName?
2. We should also clearly document the pagination behavior in the API,
similar to how you described it in your comment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]