Bug#496264: Keeping this one alive

Robert Wohlrab Thu, 16 Apr 2009 06:30:19 -0700

On Friday 03 April 2009 21:54:04 Jussi Pakkanen wrote:
> Let's not allow this to fall into limbo again. I have not heard from
> Cognitive people but as far as I know:
>
> - there is no (publicly) available source for the dat files
> - they were in the original source release, which was under BSD, so they
> should be BSD as well
The problem is currently that this stuff are blobs. Debian main should be made 
from free stuff which can be edited/compiled and so on (see firmware 
discussion on debian blogs/ml/..). All executables/libraries must have 
readable sources from which they are compiled (so not be precompiled in 
orig.tar.gz and then copied over unless it is a interpreted script in a 
readable form).
Images/fonts/sounds are somewhat different. It should be possible to edit them 
- ask a maintainer of a artworks/ttf/game package in what form they have the 
raw data in the source package. I would doubt that many of them have the raw 
tiff/bmp/pcm/psd/xcf/... inside of the package, but the files are still 
editable with tools inside of debian. A -doc package which installs a pdf 
should also contain the original (tex/docbook/...) file in the source package 
so it can be changed without too much hassle.
The data files from cuneiform are a mistery for me. A punch of bytes which 
seem to come from nowhere. I don't know how to generate them, I don't know how 
to edit them, no documentation what is inside of them (ok, somewhere must be 
code which reads it, but someone has to write a tool which can extract the 
data/regenerate the data first). If it is a tool somewhere then please 
document it. My first guess is that these files holds the recognition patterns 
used by the ocr. This is something from which we definitely want the source. 
There was a discussion some time ago about statistical data generated from 
webpages. At the end most of the people aggreed that it is not possible to do 
such an analysis each time the package needed to be build, but it had to be 
done in a way that everyone is able to recreate the data with the same high 
quality.


... maybe you should ask how it is done by the tesseract guys. I looked at the 
tesseract-deu-2.00 files and these are binary blobs too, but they are in main. 
Some training data can be found at the tesseract homepage, but no information 
if the training data was/can be used to recreate the language specific data. 
This should definitely added to the source package in a Debian specific 
readme. Maybe recreating the files from the training data by the source 
package would be nice too, but I am not sure if it is too cpu intensive or why 
it wasn't done yet. The training data seems to be done by hand and I can 
create a training page with tools in debian. I think they should qualify as 
source files.
-- 
Robert Wohlrab



-- 
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#496264: Keeping this one alive

Reply via email to