* Janusz S. Bień <jsb...@mimuw.edu.pl>, 2010-03-05, 06:30: [...]
ocrodjvu indeed crashes, but on the garbage-in-garbage-out principle. If you run ocrodjvu with the --debug option, you'll see that resulting hOCR files don't contain anything legible. In fact, hOCR for page 2 contains also some control characters, which completely break HTML parsing, leading to a crash.I cannot do much about this, except making the error message more helpful.You can skip the faulty page and continue processing.
No, that would be wrong. I cannot (programmatically) distinguish betweenexceptions caused by a faulty OCR engine and those caused by real ocrodjvu bug. Certainly I *don't* want to continue processing when the later ones are raised.
That said, if you insist on ignoring exceptions, you can easily achieve that with a simple shell script like:
cp in.djvu out.djvu djvused -e remove-txt out.djvu for p in $(seq 1 $(djvused -e n out.djvu)) do ocrodjvu -p $p --in-place --render=all --engine=cuneiform --language=pol out.djvu done -- Jakub Wilk
signature.asc
Description: Digital signature