Quoting Stefano Zacchiroli (2025-02-08 14:57:18)
> Concrete examples
> -----------------
> 
> Let's take two simple examples that, size-wise, could fit in the Debian
> archive together with training pipelines and datasets.
> 
> First, let's take a "small" image classification model that one day
> might be included in Digikam or similar free software. Let's say the
> trained model is ~1 GiB (like Moondream [1] today) and that the training
> dataset is ~10 GiB (I've no idea if the Moondrean training dataset is
> open data and I'm probably being very conservative with its size here;
> just assume it is correct for now).
> 
> [1]: https://ollama.com/library/moondream:v2/blobs/e554c6b9de01 
> 
> For a second, even smaller example, let's consider gnubg (GNU
> backgammon) that contains today, in the Debian archive, a trained neural
> network [2] of less than 1 MiB. Its training data is *not* in the
> archive, but is available online without a license (AFAICT) [3] and
> weights about ~80 MiB. The training code is available as well [4], even
> though still in Python 2.
> 
> [2]: https://git.savannah.gnu.org/cgit/gnubg.git/log/gnubg.weights
> [3]: https://alpha.gnu.org/gnu/gnubg/nn-training
> [4]: https://git.savannah.gnu.org/cgit/gnubg/gnubg-nn.git

Another example of seemingly "small" training data is Tesseract. DFSG
of training data is tracked at https://bugs.debian.org/699609 with an
optimistic view. A more pessimistic view seems indicated by an upstream
mention of the training data being "all the WWW" and other comments
mention the involvement of non-free fonts:
https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

 - Jonas

-- 
 * Jonas Smedegaard - idealist & Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/
 * Sponsorship: https://ko-fi.com/drjones

 [x] quote me freely  [ ] ask before reusing  [ ] keep private

Attachment: signature.asc
Description: signature

Reply via email to