Great read! Thanks for sharing. Highly recommend. On Fri, Jan 31, 2025 at 11:57 AM Danny McCormick via dev < dev@beam.apache.org> wrote:
> Late last year, I added support for vLLM in RunInference. I ended up being > able to go from prototyping to checked in code quickly enough that I didn't > put together/share a full design, but in retrospect I thought it might be > helpful to have a record of what I did since others might want to do > similar things with other serving systems (e.g. NIM, Triton, etc...). In > the process of writing that document, I was adding in a lot of context > about how memory management works in Beam ML and decided to include that in > the document as well. > > The end result is > https://docs.google.com/document/d/1UB4umrtnp1Eg45fiUB3iLS7kPK3BE6pcf0YRDkA289Q/edit?usp=sharing > with > the following goals: > > - Describe how model sharing works in Beam today > - Describe how we used those primitives to build out the vLLM Model > Handler > - Describe how others can add similar model handlers > > I hope this helps someone :) If you're interested, please give it a read, > and please let me know if you have any questions/feedback/ideas on how we > can keep improving our memory management story. I'll be adding this to > https://cwiki.apache.org/confluence/display/BEAM/Design+Documents to > serve as a general reference on this topic moving forward. > > Thanks, > Danny >