https://bugs.kde.org/show_bug.cgi?id=497938

--- Comment #8 from chair-tweet-de...@duck.com ---
I've tried some experiments on my side, here is the result, unfortunately
inconclusive, but if it can help for the next steps.

For your information, I’m not very experienced in machine learning, so I’m
probably missing some things.

What I hadn’t anticipated:

Before using the models, there is the preprocessing of the inputs.

For the image part, no matter which model is used, the CLIP preprocessing seems
to be the same:
https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPImageProcessor
Handling images is fairly simple, the difficulty doesn’t seem to be here.

For the text part, there is tokenization.
This is more complicated and requires additional libraries and configuration
files.
A large number of models available on Hugging Face use
https://huggingface.co/openai/clip-vit-large-patch14/tree/main, which requires
vocab.json and merges.txt.

The M-CLIP models
(https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-16Plus/tree/main,
multilingual) propose using the SentencePiece tokenizer and provide the
sentencepiece.bpe.model.

In both cases, padding and truncation need to be managed.
Model management:
I’ve tried to have one model for the image and another for the text to allow
them to be used separately, and some models (M-CLIP) already use this approach.

This project (https://github.com/Lednik7/CLIP-ONNX) seems to split the model if
the base one combines both.
This ONNX function seems to be able to extract part of a model:
https://onnx.ai/onnx/api/utils.html#extract-model.

I haven’t tested either of these methods because I found this ModelZoo, which
offers several pre-split models:
https://github.com/jina-ai/clip-as-service/blob/main/server/clip_server/model/clip_onnx.py.

For preprocessing, I wanted to include them in the model to simply have an ONNX
model that handles everything and avoid having to add libraries.

I tried with
https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/Example%20usage%20of%20the%20PrePostProcessor.md,
which seems to offer what’s necessary for image and text preprocessing and only
requires including a library for custom operators when creating the ONNX
session.
The tokenizers don’t seem to handle padding and truncation.

I haven’t managed to get a functional model.

For tokenization in C++, https://github.com/google/sentencepiece seems
interesting because it would only require having the .bpe.model (and handling
padding and truncation), which is the case for some models, but not all.
It explains how to train a BPE model, but I don’t know how to convert
vocab.json + merges.txt into the model.

For the choice of the model, the M-CLIP models seem interesting because they
offer BPE for the tokenizer, which seems simpler to use. They are multilingual
and have good performance. However, they are large, and I’m not sure if they
would work on a "lightweight" computer, or what the inference speed would be.
I also don’t have this information for other models.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to