On Friday 03 April 2009 21:54:04 Jussi Pakkanen wrote: > Let's not allow this to fall into limbo again. I have not heard from > Cognitive people but as far as I know: > > - there is no (publicly) available source for the dat files > - they were in the original source release, which was under BSD, so they > should be BSD as well The problem is currently that this stuff are blobs. Debian main should be made from free stuff which can be edited/compiled and so on (see firmware discussion on debian blogs/ml/..). All executables/libraries must have readable sources from which they are compiled (so not be precompiled in orig.tar.gz and then copied over unless it is a interpreted script in a readable form). Images/fonts/sounds are somewhat different. It should be possible to edit them - ask a maintainer of a artworks/ttf/game package in what form they have the raw data in the source package. I would doubt that many of them have the raw tiff/bmp/pcm/psd/xcf/... inside of the package, but the files are still editable with tools inside of debian. A -doc package which installs a pdf should also contain the original (tex/docbook/...) file in the source package so it can be changed without too much hassle. The data files from cuneiform are a mistery for me. A punch of bytes which seem to come from nowhere. I don't know how to generate them, I don't know how to edit them, no documentation what is inside of them (ok, somewhere must be code which reads it, but someone has to write a tool which can extract the data/regenerate the data first). If it is a tool somewhere then please document it. My first guess is that these files holds the recognition patterns used by the ocr. This is something from which we definitely want the source. There was a discussion some time ago about statistical data generated from webpages. At the end most of the people aggreed that it is not possible to do such an analysis each time the package needed to be build, but it had to be done in a way that everyone is able to recreate the data with the same high quality.
... maybe you should ask how it is done by the tesseract guys. I looked at the tesseract-deu-2.00 files and these are binary blobs too, but they are in main. Some training data can be found at the tesseract homepage, but no information if the training data was/can be used to recreate the language specific data. This should definitely added to the source package in a Debian specific readme. Maybe recreating the files from the training data by the source package would be nice too, but I am not sure if it is too cpu intensive or why it wasn't done yet. The training data seems to be done by hand and I can create a training page with tools in debian. I think they should qualify as source files. -- Robert Wohlrab -- To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org