Thanks for the feedback here and in the comments. I made a few updates and added another alternative that discusses making the environment initializer a resource hint. Looking forward to continuing the discussion. On Tue, Dec 17, 2024 at 7:36 PM Robert Bradshaw via dev <dev@beam.apache.org> wrote:
> On Tue, Dec 17, 2024 at 9:31 AM Kenneth Knowles <k...@apache.org> wrote: > > > > So is it just a documentation / examples / getting the knowledge out > there problem? > Possibly. > > Incidentally I'm not a fan of modules that "do" things when you import > them, nor am I a fan of the "try it as a module then a class" sort of > fallback stuff vs just choosing the type you expect and sticking with it, > giving very clear error messages. Also "ImportError" is going to be > misinterpreted 99% of the time. Having something that calls a named > function seems like it'll be a better experience all around. > > This was initially introduced to register things like filesystems, as > Python doesn't have the service provider interface stuff that Java > has, so we need to "run some code on startup" to register it. I agree > a named function would be better, just thinking it might be preferable > to avoid two distinct ways of doing almost the same thing. > This option wouldn't work for single-file pipelines, but I checked it and can be used for pipelines that are structured as a package. It is a bit awkward to use since initialization has to be defined in the top-level module code. > > > > > Kenn > > > > On Fri, Dec 13, 2024 at 4:38 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> > >> We already have > >> > https://github.com/apache/beam/blob/release-2.40.0/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L141 > >> that allows arbitrary code to be imported and executed on worker > >> startup. (Perhaps we could generalize to let it also reference a > >> function to be called rather than just a module.) > >> > >> On Fri, Dec 13, 2024 at 12:52 PM Danny McCormick via dev > >> <dev@beam.apache.org> wrote: > >> > > >> > Thanks - I actually was thinking about this today and was annoyed > that we don't have this ability. I'm +1 to the proposed approach. > >> > > >> > I dropped a comment, but also upleveling in case there is broader > interest; it would be nice to have a similar capability for expansion > service containers as well. > >> > > >> > On Fri, Dec 13, 2024 at 3:23 PM Valentyn Tymofieiev via dev < > dev@beam.apache.org> wrote: > >> >> > >> >> Hi everyone, > >> >> > >> >> Currently we don't have a straightforward and documented way to do > simple initialization steps on every Beam Python SDK worker before data > processing starts. It is a rough edge that I've encountered on several > occasions myself and in conversations with Beam users > >> >> > >> >> I put together some thoughts on how we could provide that capability > in https://s.apache.org/python_sdk_worker_initialization . Looking > forward to your ideas and other feedback on this topic. > >> >> > >> >> Thanks, > >> >> Valentyn >