Hey everyone,

Right now, using RunInference with large models and on GPUs has several
performance gaps. I put together a document focusing on one: when running
inference with large models, we often OOM because we load several copies of
the model at once. My document explores using the multi_process_shared.py
utility to load models, provides a couple of benchmarks, and finds that we
can recommend using the utility for pipelines which load a large model for
inference, but not for pipelines that normally don’t have memory issues.

Please take a look and let me know if you have any questions or concerns!

Doc -
https://docs.google.com/document/d/10xAIxu3W3wonFaLWXqneZ3CmOLaS1Z9dvn3eSDynDqE/edit?usp=sharing

Thanks.
Danny

Reply via email to