Hi, Let's think in a bit different perspective.
What is the outcome of "Deep Lerning". That's "knowledge". If the dictionary of "knowledge" is expressed in a freely usable software format with free license, isn't it enough? If you want more for your package, that's fine. Please promote such program for your project. (FYI: the reason I spent my time for fixing "anthy" for Japanese text input is I didn't like the way "mozc" looked as a sort of dump-ware by Google containing the free license dictionary of "knowledge" without free base training data.) But placing some kind of fancy purist "Policy" wording to police other software doesn't help FREE SOFTWARE. We got rid of Netscape from Debian because we now have good functional free alternative. If you can make model without any reliance to non-free base training data for your project, that's great. I think it's a dangerous and counter productive thing to do to deprive access to useful functionality of software by requesting to use only free data to obtain "knowledge". Please note that the re-training will not erase "knowledge". It usually just mix-in new "knowledge" to the existing dictionary of "knowledge". So the resulting dictionary of "knowledge" is not completely free of the original training data. We really need to treat this kind of dictionary of "knowledge" in line with artwork --- not as a software code. Training process itself may be mathematical, but the preparation of training data and its iterative process of providing the re-calibrating data set involves huge human inputs. > Enforcing re-training will be a painful decision... Hmmm... this may depends on what kind of re-training. At least for unidic-mecab, re-training to add many new words to be recognized by the morphological analyzer is an easier task. People has used unidic-mecab and web crawler to create even bigger dictionary with minimal work of re-training (mostly automated, I guess.) https://github.com/neologd/mecab-unidic-neologd/ I can't imagine to re-create the original core dictionary of "knowledge" for Japanese text processing purely by training with newly provided free data since it takes too much human works and I agree it is unrealistic without serious government or corporate sponsorship project. Also, the "knowledge" for Japanese text processing should be able to cover non-free texts. Without using non-free texts as input data, how do you know it works on them. > Isn't this checking mechanism a part of upstream work? When developing > machine learning software, the model reproduciblity (two different runs > should produce very similar results) is important. Do you always have a luxury of relying on such friendly/active upstream? If so, I see no problem. But what should we do if not? Anthy's upstream is practically Debian repo now. Osamu