https://bugs.kde.org/show_bug.cgi?id=497938
--- Comment #8 from chair-tweet-de...@duck.com --- I've tried some experiments on my side, here is the result, unfortunately inconclusive, but if it can help for the next steps. For your information, I’m not very experienced in machine learning, so I’m probably missing some things. What I hadn’t anticipated: Before using the models, there is the preprocessing of the inputs. For the image part, no matter which model is used, the CLIP preprocessing seems to be the same: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPImageProcessor Handling images is fairly simple, the difficulty doesn’t seem to be here. For the text part, there is tokenization. This is more complicated and requires additional libraries and configuration files. A large number of models available on Hugging Face use https://huggingface.co/openai/clip-vit-large-patch14/tree/main, which requires vocab.json and merges.txt. The M-CLIP models (https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-16Plus/tree/main, multilingual) propose using the SentencePiece tokenizer and provide the sentencepiece.bpe.model. In both cases, padding and truncation need to be managed. Model management: I’ve tried to have one model for the image and another for the text to allow them to be used separately, and some models (M-CLIP) already use this approach. This project (https://github.com/Lednik7/CLIP-ONNX) seems to split the model if the base one combines both. This ONNX function seems to be able to extract part of a model: https://onnx.ai/onnx/api/utils.html#extract-model. I haven’t tested either of these methods because I found this ModelZoo, which offers several pre-split models: https://github.com/jina-ai/clip-as-service/blob/main/server/clip_server/model/clip_onnx.py. For preprocessing, I wanted to include them in the model to simply have an ONNX model that handles everything and avoid having to add libraries. I tried with https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/Example%20usage%20of%20the%20PrePostProcessor.md, which seems to offer what’s necessary for image and text preprocessing and only requires including a library for custom operators when creating the ONNX session. The tokenizers don’t seem to handle padding and truncation. I haven’t managed to get a functional model. For tokenization in C++, https://github.com/google/sentencepiece seems interesting because it would only require having the .bpe.model (and handling padding and truncation), which is the case for some models, but not all. It explains how to train a BPE model, but I don’t know how to convert vocab.json + merges.txt into the model. For the choice of the model, the M-CLIP models seem interesting because they offer BPE for the tokenizer, which seems simpler to use. They are multilingual and have good performance. However, they are large, and I’m not sure if they would work on a "lightweight" computer, or what the inference speed would be. I also don’t have this information for other models. -- You are receiving this mail because: You are watching all bug changes.