On 2018-12-29 12:47 AM, Toby Thain via cctalk wrote: > On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote: >> Finally I got hold of the sources for the PDP-11 SPACE WAR that was >> submitted to DECUS by Bill Seiler. >> >> The format is scans of the PAL-11S listing output. It is easy to crop the >> image to only contain actual source. Then running OCR on it. Tried a few >> online versions and tesseract. >> >> The problem is that the paper that the listing is printed on has lines. >> Very black lines. It makes the OCR go completely crazy. Source lines >> without black lines OCR ok. The others do not. The files need massive >> amount of manual intervention. >> >> Does anyone have an idea how to process files like this? >> >> A good way to remove the black lines? > > Hi Mattis > > Here's a first cut. Can probably be improved slightly. Let me know how > much this still confuses Tesseract. > > https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif >
That is a multipage TIF, and the page order key is listed below. I just noticed that a handful of pages seem to be missing, so I'll look into that. CHAR--0000 CHAR--0001 CHAR--0002 CHRTAB--0000 CHRTAB--0001 CHRTAB--0002 COMPAR--0000 COMPAR--0001 COMPAR--0002 COMPAR--0003 EXPLOD--0000 EXPLOD--0001 EXPLOD--0002 GRAVTY--0000 GRAVTY--0001 GRAVTY--0002 GRAVTY--0003 MULPLY--0000 MULPLY--0001 MULPLY--0002 PARM--0000 PARM--0001 PARM--0002 PARM--0003 PARM--0005 PARM--0006 PARM--0007 PARM--0008 PARM--0009 PWRUP--0000 PWRUP--0001 RESET--0000 RESET--0001 RKT1--0000 RKT1--0001 RKT2--0000 RKT2--0001 SCORE--0000 SCORE--0001 SINCOS--0000 SINCOS--0001 SINCOS--0002 SLINE--0000 SLINE--0001 SPCWAR--0000 SPCWAR--0001 SPCWAR--0002 SUN--0000 SUN--0001 SUN--0002 UPDAT1--0000 UPDAT1--0001 UPDAT1--0002 UPDAT2--0000 UPDAT2--0002 point--0000 point--0001 > --Toby > >> >> There are only 19 source files with three or four pages each so I don't >> think it makes sense to try to train tesseract to do it (training tesseract >> seems to be a huge undertaking). >> >> https://i.imgur.com/dvY973s.png >> >> /Mattis >> > >