Re: [tesseract-ocr] Assessment of the difficulty in porting CPU architecture for tesseract

Ger Hobbelt Mon, 20 Nov 2023 01:06:35 -0800

When you say RISCV I assume you are targeting the new Chinese CPUs.

Tesseract is entirely written in portable C + C++, so that gives us these
assumptions / guestimates for risk and effort analysis for a porting
project like yours (by the sound of it):


- I expect (assume) a Linux-based operating system and development
environment is already present. If not, tackle that first. I say "Linux" as
that's the mainstream "POSIX-compliant" UNIX flavor one expects to see with
modern boards (cpu, memory, I/O, all the usual stuff that you'll find on
advanced Dev boards) and tesseract doesn't *require* it *per sé*, but
everything will go much easier. So prerequisite #1: you've got a working,
modern, Linux, on your rig.

- I expect a working C/C++ compiler for your hardware. Must support C++17
to be safe, preferably gnu GCC or clang/llvm, as that's what most people
use. You have cmake, make, gnu autotools, perl, python, all working on your
rig. If not, tackle that first. The latter two are not mandatory, but when
you want to run any training of any kind, so a FULL PORT of tesseract,
you'll be happy to have them (perl and python, that is).
Personally, I use Microsoft's MSVC2022 and I'm the odd one out. (Though we
could argue @stweil clearly uses the same Windows-based Dev environment as
he's published tesseract installers for years ;-) )

Others may mention "cross compiling" and yes, that's an option, but I
haven't seen anyone do or try that recently with tesseract. Again, you want
to hug mainstream as much as possible here as you want to tackle *other*
issues and it's encouraging to know you have a platform that looks and
sounds a lot like the regular usage pattern out there - apart from your
particular CPU, that is.

- next: porting! As "everything" is done in portable C/C++, that job sounds
"done". It is NOT. What *your* work now entails is porting the very
important simd optimized code parts (matrix calculus) that's at the core of
tesseract. Here you will have to produce a variant of optimized code that
can co-exist next to the existing variants already included in tesseract.

'grep' for "AVX", "SSE", "FSM" (all caps) in the tesseract source code to
find the spots that need your work. Tesseract comes with a non-optimized
generic portable C variant of the same code, so the algorithm you need to
provide fast code for is easy to read.

DO NOT FORGET THE OTHER PART: the cpu detection and cpu capability/feature
extract function: that one determines if tesseract will run your optimized
code, so iut is to have that bit working as well. When you've found the
SSE/AVX code chunks, you should dig *up* (instead of down) and you'll find
the important cpu detect + dispatch calls easily.

This, thus far, describes the "low hanging fruit" of a regular port of
tesseract to a new cpu.
Given that you probably(?) have access to RVV/SVE on your hardware, you
might want to address the entire BLSTM engine codebase from that new
perspective. My advice: do that as a "stage 2" in your project flow and
also separate it out in your proposal as this will be a much more major,
difficult task. I see lots of profiler runs and other investigative work in
your future. ;-)

...
That should set you up & going. Tasks:

- check the listed assumptions: make those match.
- tesseract GitHub fork, code inspection. The 'grep' bit. (Here I assume
you understand what I mean when I only use the word: grep; otherwise this
email becomes a book)
- write cpu feature detector code, patch the dispatcher code, add RISCV
optimized code variant next to the existing AVX/SSE variants. Integrate new
code in the tesseract build scripts. TEST. Run a few images through your
new tesseract. REPRODUCE (on other comparable hardware).

- if you also want to TRAIN tesseract, look for Shree's work, the tesstrain
GitHub repo and all the other tesseract organisation's repo's on GitHub,
lurk, read, **a lot**.
Personally I haven't done any tesseract training yet, because I could get
around that task. I'm self-financed (paying me on my own dime), which is
ultra-rare, and from what I've seen so far, training is "no fun" unless you
have direct and unrestricted access to nice amounts of top of the line
industrial hardware. (I am very patient, but only with some people and far
fewer machines. ;-) )

Hope that helps,

Cheerio,

Ger



On Thu, 16 Nov 2023, 09:50 yuxuan wang, <wangyuxuan1...@gmail.com> wrote:

>  Hello everyone! I am working on implementing a tool to assess the
> complexity of CPU architecture porting. It primarily focuses on RISC-V
> architecture porting. In fact, the tool may have an average estimate of
> various architecture porting efforts.My focus is on the overall workload
> and difficulty of transplantation in the past and future,even if a project
> has already been ported.As part of my dataset, I have collected the
> **tesseract** project. **I would like to gather community opinions to
> support my assessment. I appreciate your help and response!** Based on
> scanning tools, the porting complexity is determined to be simple, with a
> small amount of code related to the CPU architecture in the project.  Is
> this assessment accurate?Do you often have any opinions on personnel
> allocation and consumption time？ I look forward to your help and response.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4b84ce8d-a316-4850-9240-5e2a87b50eb7n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4b84ce8d-a316-4850-9240-5e2a87b50eb7n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fqH3kfbSfDxgUZkKx6NfyG%3DN87BcV-rTionfwt5X2qdNw%40mail.gmail.com.

Re: [tesseract-ocr] Assessment of the difficulty in porting CPU architecture for tesseract

Reply via email to