Hey everyone, Right now, using RunInference with large models and on GPUs has several performance gaps. I put together a document focusing on one: when running inference with large models, we often OOM because we load several copies of the model at once. My document explores using the multi_process_shared.py utility to load models, provides a couple of benchmarks, and finds that we can recommend using the utility for pipelines which load a large model for inference, but not for pipelines that normally don’t have memory issues.
Please take a look and let me know if you have any questions or concerns! Doc - https://docs.google.com/document/d/10xAIxu3W3wonFaLWXqneZ3CmOLaS1Z9dvn3eSDynDqE/edit?usp=sharing Thanks. Danny