Right then, I've created a wiki in Google Code for this collected
effort.

https://code.google.com/p/tesseract-ocr-extradocs/

I have spent some time this last week reading some of the cube code
and figuring out the purpose of the various cube training files. I
still don't know the most interesting stuff, which is exactly how
the .nn files are used, but it was taking me a while to read the
code so I though I'd just post what I have so far.

If anyone wants to add to the wiki let me know and I'll gladly add
you to the project.

The next thing on my list to document is line segmentation, though I
should probably try to add more information on how cube works first.

I hope this looks useful to people, and inspires everyone to dig
into all of the code :)

Nick

On Mon, Jun 03, 2013 at 10:49:46AM -0400, Sven Pedersen wrote:
> Sounds good. I think we should make some attempt to reverse engineer the Cube
> engine. I imagine Google will eventually release documentation, but we don't
> know when, if we document it they may be more inclined to give their side of 
> it
> more quickly. It is very possible they don't have much internal documentation
> anyway.
> --Sven
> 
> 
> On Mon, Jun 3, 2013 at 10:25 AM, Nick White <[email protected]> wrote:
> 
>     I wonder, would others here be interested in figuring out and
>     documenting little bits of how the code works?
> 
>     I spent some time in the line segmentation code a little while ago,
>     to figure out better configuration parameters for line segmentation
>     for the Ancient Greek training (which ended up being pretty
>     successful), and I could certainly contribute a partial description
>     of how it works.
> 
>     If others are interested in doing this for key sections (like the
>     parts Dmitri suggested), perhaps we should set up a wiki and get to
>     work? It wouldn't be comprehensive, of course, but sharing what we
>     know could still prove pretty useful.
> 
>     What do people think? Is anyone else interested in doing this?
> 
>     I'll dig out the (very scrappy) notes I made on line segmentation,
>     clean them up, and post them here, when I get time. If anyone else
>     is interested, I'll set up a wiki somewhere.
>    
>     Nick
>    
>     On Thu, May 30, 2013 at 07:32:52PM +0400, Dmitri Silaev wrote:
>     > Excellent post, Nick! The more I read, the more I felt I had to ask
>     > these questions myself, but didn't yet. I'm afraid, though, many of
>     > them would remain unanswered.
>     >
>     > Because after several years of monitoring and asking in this forum I
>     > got used to the feeling that principal developers make only new
>     > release announcements. In the early years, they were much more active
>     > in discussions. I can suppose many of forum questions are tedious to
>     > answer over and over again, the forum search can be used, and many
>     > people just feel lazy to use it. But some of them are not like that
>     > and deserve answers.
>     >
>     > Now it looks like Google is doing us a favor making a formerly
>     > commercial engine outsource and sharing its developments from time to
>     > time. The community contribution now is constrained by enhancing
>     > release packages and fixing trivial bugs. Without a proper
>     > documentation or at least clues on how all this (not only Cube) works,
>     > developers keep community contribution nominal. I personally need more
>     > info and am ready to contribute, if I begin to understand the code
>     > enough. I used to surf the code alone, but the potential of this
>     > approach is limited. Off the bat, I'm interested in segmentation,
>     > details on class pruner and integer matcher, description of Cube, best
>     > practices on training data generation. I think, there are more to
>     > come, once I get more info on these.
>     >
>     > --
>     > Dmitri
>     >
>     >
>     > On Thu, May 30, 2013 at 6:48 PM, Nick White <[email protected]>
>     wrote:
>     > > Hi Tesseractors,
>     > >
>     > > I am feeling a bit fed up about the lack of openness with the
>     > > Tesseract project.
>     > >
>     > > The addition of the cube mode, and several trainings, with
>     > > absolutely no documentation, or (as far as I can tell) any tools to
>     > > create cube training files, is a good example of this.
>     > >
>     > > As is the lack of tif/box files for any of the core training files
>     > > in the project.
>     > >
>     > > Keeping the cube tools and documentation private sucks royally. If
>     > > they aren't perfect or polished, it doesn't matter; we could help
>     > > to fix them up!
>     > >
>     > > I suspect some of the tif/box files for training aren't being
>     > > released because of concerns about copyright of the image files. If
>     > > that's the case please work to clear them up, or create freely
>     > > reusable versions.
>     > >
>     > > I love Tesseract; having a very high quality free software OCR
>     > > package is awesome, and I'm very grateful for the amazing work being
>     > > done on it. But I find the lack of parity between those inside
>     > > Google and the wider community to be rather troubling.
>     > >
>     > > If there's anything I can do to help make cube training tools and
>     > > documentation available, or the training source files, I'd be very
>     > > happy to help. Replying offlist if appropriate is fine.
>     > >
>     > > Nick
>     > >
>     > > --
>     > > --
>     > > You received this message because you are subscribed to the Google
>     > > Groups "tesseract-ocr" group.
>     > > To post to this group, send email to [email protected]
>     > > To unsubscribe from this group, send email to
>     > > [email protected]
>     > > For more options, visit this group at
>     > > http://groups.google.com/group/tesseract-ocr?hl=en
>     > >
>     > > ---
>     > > You received this message because you are subscribed to the Google
>     Groups "tesseract-ocr" group.
>     > > To unsubscribe from this group and stop receiving emails from it, send
>     an email to [email protected].
>     > > For more options, visit https://groups.google.com/groups/opt_out.
>     > >
>     > >
>     >
>     > --
>     > --
>     > You received this message because you are subscribed to the Google
>     > Groups "tesseract-ocr" group.
>     > To post to this group, send email to [email protected]
>     > To unsubscribe from this group, send email to
>     > [email protected]
>     > For more options, visit this group at
>     > http://groups.google.com/group/tesseract-ocr?hl=en
>     >
>     > ---
>     > You received this message because you are subscribed to the Google 
> Groups
>     "tesseract-ocr" group.
>     > To unsubscribe from this group and stop receiving emails from it, send 
> an
>     email to [email protected].
>     > For more options, visit https://groups.google.com/groups/opt_out.
>     >
>     >
> 
>     --
>     --
>     You received this message because you are subscribed to the Google
>     Groups "tesseract-ocr" group.
>     To post to this group, send email to [email protected]
>     To unsubscribe from this group, send email to
>     [email protected]
>     For more options, visit this group at
>     http://groups.google.com/group/tesseract-ocr?hl=en
> 
>     ---
>     You received this message because you are subscribed to the Google Groups
>     "tesseract-ocr" group.
>     To unsubscribe from this group and stop receiving emails from it, send an
>     email to [email protected].
>     For more options, visit https://groups.google.com/groups/opt_out.
> 
> 
> 
> 
> 
> 
> --
> ``All that is gold does not glitter,
>   not all those who wander are lost;
> the old that is strong does not wither,
>   deep roots are not reached by the frost.
> From the ashes a fire shall be woken,
>   a light from the shadows shall spring;
> renewed shall be blade that was broken,
>   the crownless again shall be king.”
> 
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>  
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email
> to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to