[Benchmarks/Proposal] Loading models with multi_process_shared.py for RunInference

Danny McCormick via dev Mon, 15 May 2023 11:42:41 -0700

Hey everyone,

Right now, using RunInference with large models and on GPUs has several
performance gaps. I put together a document focusing on one: when running
inference with large models, we often OOM because we load several copies of
the model at once. My document explores using the multi_process_shared.py
utility to load models, provides a couple of benchmarks, and finds that we
can recommend using the utility for pipelines which load a large model for
inference, but not for pipelines that normally don’t have memory issues.


Please take a look and let me know if you have any questions or concerns!

Doc -
https://docs.google.com/document/d/10xAIxu3W3wonFaLWXqneZ3CmOLaS1Z9dvn3eSDynDqE/edit?usp=sharing

Thanks.
Danny

[Benchmarks/Proposal] Loading models with multi_process_shared.py for RunInference

Reply via email to