Quoting Stefano Zacchiroli (2025-02-08 14:57:18) > Concrete examples > ----------------- > > Let's take two simple examples that, size-wise, could fit in the Debian > archive together with training pipelines and datasets. > > First, let's take a "small" image classification model that one day > might be included in Digikam or similar free software. Let's say the > trained model is ~1 GiB (like Moondream [1] today) and that the training > dataset is ~10 GiB (I've no idea if the Moondrean training dataset is > open data and I'm probably being very conservative with its size here; > just assume it is correct for now). > > [1]: https://ollama.com/library/moondream:v2/blobs/e554c6b9de01 > > For a second, even smaller example, let's consider gnubg (GNU > backgammon) that contains today, in the Debian archive, a trained neural > network [2] of less than 1 MiB. Its training data is *not* in the > archive, but is available online without a license (AFAICT) [3] and > weights about ~80 MiB. The training code is available as well [4], even > though still in Python 2. > > [2]: https://git.savannah.gnu.org/cgit/gnubg.git/log/gnubg.weights > [3]: https://alpha.gnu.org/gnu/gnubg/nn-training > [4]: https://git.savannah.gnu.org/cgit/gnubg/gnubg-nn.git
Another example of seemingly "small" training data is Tesseract. DFSG of training data is tracked at https://bugs.debian.org/699609 with an optimistic view. A more pessimistic view seems indicated by an upstream mention of the training data being "all the WWW" and other comments mention the involvement of non-free fonts: https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 - Jonas -- * Jonas Smedegaard - idealist & Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ * Sponsorship: https://ko-fi.com/drjones [x] quote me freely [ ] ask before reusing [ ] keep private
signature.asc
Description: signature