Hi Tim, Yes, I think that this feature can be implemented relatively fast. If this blocks you at the moment, I can prepare a branch for you to experiment with, in the following days.
Regarding to open tracing integration, I think the community can benefit a lot out of this, and definitely contributions are welcome! @Konstantin Knauf <kna...@apache.org> would you like to understand more in depth, Tim's use case with opentracing? Thanks, Igal. On Tue, Apr 20, 2021 at 8:10 PM Timothy Bess <tdbga...@gmail.com> wrote: > Hi Igal, > > Yes! that's exactly what I was thinking. The batching will naturally > happen as the model applies backpressure. We're using pandas and it's > pretty costly to create a dataframe and everything to process a single > event. Internally the SDK has access to the batch and is calling my > function, which creates a dataframe for each individual event. This causes > a ton of overhead since we basically get destroyed by the constant factors > around creating and operating on dataframes. > > Knowing how the SDK works, it seems like it'd be easy to do something like > your example and maybe have a different decorator for "batch functions" > where the SDK just passes in everything at once. > > Also just out of curiosity are there plans to build out more introspection > into statefun's flink state? I was thinking it would be super useful to add > either Queryable state or have some control topic that statefun listens to > that allows me to send events to introspect or modify flink state. > > For example like: > > // control topic request > {"type": "FunctionIdsReq", "namespace": "foo", "type": "bar"} > // response > {"type": "FunctionIdsResp", "ids": [ "1", "2", "3", ... ] } > > Or > > {"type": "SetState", "namespace": "foo", "type": "bar", "id": "1", value: > "base64bytes"} > {"type": "DeleteState", "namespace": "foo", "type": "bar", "id": "1"} > > Also having opentracing integration where Statefun passes b3 headers with > each request so we can trace a message's route through statefun would be > _super_ useful. We'd literally be able to see the entire path of an event > from ingress to egress and time spent in each function. Not sure if there > are any plans around that, but since we're live with a statefun project > now, it's possible we could contribute some if you guys are open to it. > > Thanks, > > Tim > > On Tue, Apr 20, 2021 at 9:25 AM Igal Shilman <i...@ververica.com> wrote: > >> Hi Tim! >> >> Indeed the StateFun SDK / StateFun runtime, has an internal concept of >> batching, that kicks in the presence of a slow >> /congested remote function. Keep in mind that under normal circumstances >> batching does not happen (effectively a batch of size 1 will be sent). [1] >> This batch is not currently exposed via the SDKs (both Java and Python) >> as it is an implementation detail (see [2]). >> >> The way I understand your message (please correct me if I'm wrong): is >> that evaluation of the ML model is costly, and it would benefit from some >> sort of batching (like pandas do i assume ?) >> instead of being applied for every event individually. >> If this is the case, perhaps exposing this batch can be a useful feature >> to add. >> >> For example: >> >> @functions.bind_tim(..) >> def ml(context, messages: typing.List[Message]): >> ... >> >> >> >> Let me know what you think, >> Igal. >> >> >> >> [1] >> https://github.com/apache/flink-statefun/blob/master/statefun-sdk-protos/src/main/protobuf/sdk/request-reply.proto#L80 >> [2] >> https://github.com/apache/flink-statefun/blob/master/statefun-sdk-python/statefun/request_reply_v3.py#L219 >> >> On Fri, Apr 16, 2021 at 11:48 PM Timothy Bess <tdbga...@gmail.com> wrote: >> >>> Hi everyone, >>> >>> Is there a good way to access the batch of leads that Statefun sends to >>> the Python SDK rather than processing events one by one? We're trying to >>> run our data scientist's machine learning model through the SDK, but the >>> code is very slow when we do single events and we don't get many of the >>> benefits of Pandas/etc. >>> >>> Thanks, >>> >>> Tim >>> >>