Re: How vLLM Model Handler Works (Plus a Summary of Model Memory Management in Beam ML)

Kenneth Knowles Fri, 31 Jan 2025 09:10:31 -0800

Great read! Thanks for sharing. Highly recommend.

On Fri, Jan 31, 2025 at 11:57 AM Danny McCormick via dev <
[email protected]> wrote:


> Late last year, I added support for vLLM in RunInference. I ended up being
> able to go from prototyping to checked in code quickly enough that I didn't
> put together/share a full design, but in retrospect I thought it might be
> helpful to have a record of what I did since others might want to do
> similar things with other serving systems (e.g. NIM, Triton, etc...). In
> the process of writing that document, I was adding in a lot of context
> about how memory management works in Beam ML and decided to include that in
> the document as well.
>
> The end result is
> https://docs.google.com/document/d/1UB4umrtnp1Eg45fiUB3iLS7kPK3BE6pcf0YRDkA289Q/edit?usp=sharing
>  with
> the following goals:
>
>    - Describe how model sharing works in Beam today
>    - Describe how we used those primitives to build out the vLLM Model
>    Handler
>    - Describe how others can add similar model handlers
>
> I hope this helps someone :) If you're interested, please give it a read,
> and please let me know if you have any questions/feedback/ideas on how we
> can keep improving our memory management story. I'll be adding this to
> https://cwiki.apache.org/confluence/display/BEAM/Design+Documents to
> serve as a general reference on this topic moving forward.
>
> Thanks,
> Danny
>

Re: How vLLM Model Handler Works (Plus a Summary of Model Memory Management in Beam ML)

Reply via email to