On Mon, 01 Feb 2021 16:19:03 +0800 Paul Wise wrote: > It has been made clear in this Hacker News subthread that the > RNNoisem odel has been trained in part using proprietary data:
This is correct. There's been some discussion about this on IRC with respect to that thread and the Debian Machine Learning and Software Freedom policy proposal: https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst There was some confusion because Mozilla collected public data submissions for the project. According to author Jean-Marc Valin, these data were *not used* to train the currently-published rnnoise model, which was instead trained on other free and non-free data sets. The crowdsourced data set was published under a CC0 license and is available for further work, but it needs cleaning and characterization before it can be directly useful for training. Data download: https://media.xiph.org/rnnoise/ Original click-through soliciting license agreement from submittors: https://web.archive.org/web/20171003052023/https://people.xiph.org/~jm/demo/rnnoise/donate.html So, rnnoise falls under the "toxic candy" model classification in the policy proposal. It's good to have names for these situations, and definitely good to ask for public data for training models, but I don't think it would be reasonable to block packaging rnnoise based on this criterion. Compression technologies, whether for voice, music, images, or video have all been tested and tuned against source data which is not all publicly redistributable. For example, the codebooks of the speex codec, part of Debian since 2002, were trained on some of the same proprietary datasets as the default rnnoise model. Even the Linux kernel is tuned using proprietary workloads. Recent interest in machine learning has made better tools for model training available, bringing us closer to applying the modification aspect of software freedom to parameter sets used to configure software. That's a step forward. Deciding that new models must reach a higher bar than established code would be a step back.
signature.asc
Description: This is a digitally signed message part