On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote: > Finally I got hold of the sources for the PDP-11 SPACE WAR that was > submitted to DECUS by Bill Seiler. > > The format is scans of the PAL-11S listing output. It is easy to crop the > image to only contain actual source. Then running OCR on it. Tried a few > online versions and tesseract. > > The problem is that the paper that the listing is printed on has lines. > Very black lines. It makes the OCR go completely crazy. Source lines > without black lines OCR ok. The others do not. The files need massive > amount of manual intervention. > > Does anyone have an idea how to process files like this? > > A good way to remove the black lines?
Hi Mattis Here's a first cut. Can probably be improved slightly. Let me know how much this still confuses Tesseract. https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif --Toby > > There are only 19 source files with three or four pages each so I don't > think it makes sense to try to train tesseract to do it (training tesseract > seems to be a huge undertaking). > > https://i.imgur.com/dvY973s.png > > /Mattis >