That looks right, thanks for that.
I'll try to take a proper look soon and figure out how best to
upstream stuff, and where it's worth doing so. In the meantime I'll
attach the .diff (very small; only 200 lines), in case anyone else
is interested, and so I don't forget ;)
Nick
On Wed, May 15, 2013 at 07:18:42AM -0700, Renard Wellnitz wrote:
> Hi Nick,
>
> here is the console output:
>
>
> localhost:tesseract-ocr-3.02 renard$ svn log -r COMMITTED
> ------------------------------------------------------------------------
> r705 | [email protected] | 2012-03-15 22:05:12 +0100 (Thu, 15 Mar 2012) | 1
> line
>
> fixed build in java directory; create documentation package with 'make
> doc-pack'
> ------------------------------------------------------------------------
>
>
> Cheers
> Renard
>
>
> Am Mittwoch, 15. Mai 2013 14:28:35 UTC+2 schrieb Nick White:
>
> I'm no expert with SVN, but I think this command will tell me what I
> want to know:
>
> svn log -r COMMITTED
>
> Thanks.
>
> On Wed, May 15, 2013 at 04:02:34AM -0700, Renard Wellnitz wrote:
> > Hi Nick,
> >
> > i'm not really proficient with svn. Maybe this helps? If you want me to
> run a
> > specific svn command i'll gladly do it.
> >
> >
> > localhost:tesseract-ocr-3.02 renard$ svn ls "^/tags"
> > release-2.04/
> > release-3.00/
> > release-3.00.1/
> > release-3.01/
> > release-3.02.01/
> > release-3.02.02/
> > localhost:tesseract-ocr-3.02 renard$ svnversion .
> > 705M
> > localhost:tesseract-ocr-3.02 renard$
> >
> >
> > I do not remember the exact changes. But my main goals was the get
> progress
> > information during the ocr process so that my app could show the
> bounding
> boxes
> > of the currently processed word.
> >
> > Cheers
> > Renard
> >
> >
> > Am Mittwoch, 15. Mai 2013 11:37:26 UTC+2 schrieb Nick White:
> >
> > Ah, I see it's pretty close to 3.02.01 (now only available as an SVN
> > tag). Am I correct in thinking that's the release you used? Or was
> > it a SVN revision near it?
> >
> > Thanks again,
> >
> > Nick
> >
> > On Wed, May 15, 2013 at 10:30:29AM +0100, Nick White wrote:
> > > Hi Renard,
> > >
> > > This is awesome, great job :)
> > >
> > > I was interested to see what changes you'd made to tesseract, so
> ran
> > > 'diff -r' on the tesseract-ocr-3.02 directory in github, but a
> quick
> > > look made it seem quite different to the
> > > tesseract-ocr-3.02.02.tar.gz currently available from Tesseract.
> > >
> > > Am I correct in thinking that? Is it based on a version from SVN?
> If
> > > so, which? If not, I'll just have to spend more time with diff ;-)
> > >
> > > I'd be keen to try and isolate and generalise any changes you made
> > > and get them back into the core code, if I can.
> > >
> > > Thanks for all this lovely free code!
> > >
> > > Nick
> > >
> > > On Tue, May 14, 2013 at 01:51:15PM -0700, Renard Wellnitz wrote:
> > > > Hi Tom,
> > > >
> > > > i decided to publish the code of the app under the Apache 2
> licence.
> > However
> > > > the c++ code that deals with image processing uses the stricter
> GLP v3
> > since
> > > > that is the place where i put a lot of effort into.
> > > >
> > > > The project still needs a readme and instructions on how to
> build
> the
> > binaries.
> > > > For someone with a bit of Android/NDK experience it should be
> not
> a big
> > problem
> > > > however.
> > > > Readme and build instructions will follow in a couple of days.
> > > >
> > > > https://github.com/renard314/textfairy
> > > >
> > > > Cheers!
> > > > Renard
> >
> > --
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to
> > [email protected]
> > For more options, visit this group at
> > http://groups.google.com/group/tesseract-ocr?hl=en
> >
> > ---
> > You received this message because you are subscribed to the Google
> Groups
> > "tesseract-ocr" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an
> email
> > to [email protected].
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email
> to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.
diff -r tesseract-ocr-r705/api/baseapi.cpp textfairy/tesseract-ocr-3.02/api/baseapi.cpp
34a35,37
> /* Version number of package */
> #define VERSION "3.02"
>
36a40
>
849c853
< text = GetHOCRText(page_index);
---
> text = GetHOCRText(NULL, page_index);
931a936,1044
>
>
> char* TessBaseAPI::GetHTMLText(const float minConfidenceToShowColor) {
> if (page_res_ == NULL) {
> return NULL;
> }
> int lcnt = 1, bcnt = 1, pcnt = 1, wcnt = 1;
>
> STRING html_str("");
> bool isItalic = false;
> bool isBold = false;
>
>
> ResultIterator *res_it = GetIterator();
> for (; !res_it->Empty(RIL_BLOCK); wcnt++) {
> if (res_it->Empty(RIL_WORD)) {
> res_it->Next(RIL_WORD);
> continue;
> }
>
> // Open any new block/paragraph/textline.
> if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
> html_str +="<div>";
> }
> if (res_it->IsAtBeginningOf(RIL_PARA)){
> html_str += "<p>";
> }
>
> // Now, process the word...
> const char *font_name;
> bool bold, italic, underlined, monospace, serif, smallcaps;
> int pointsize, font_id;
> font_name = res_it->WordFontAttributes(&bold, &italic, &underlined,
> &monospace, &serif, &smallcaps,
> &pointsize, &font_id);
> bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
> bool last_word_in_para = res_it->IsAtFinalElement(RIL_PARA, RIL_WORD);
> bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
>
> float confidence = res_it->Confidence(RIL_WORD);
> bool addConfidence = false;
> if ( confidence<minConfidenceToShowColor && res_it->GetUTF8Text(RIL_WORD)!=" "){
> addConfidence = true;
> html_str.add_str_int("<font conf='", (int)confidence);
> html_str += "' color='#DE2222'>";
> }
>
> /*
> if (!isBold && bold) {
> html_str += "<em>";
> isBold = true;
> }
> */
>
> if (!isItalic && italic) {
> html_str += "<strong>";
> isItalic = true;
> }
> do {
> const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
> if (grapheme && grapheme[0] != 0) {
> if (grapheme[1] == 0) {
> switch (grapheme[0]) {
> case '<': html_str += "<"; break;
> case '>': html_str += ">"; break;
> case '&': html_str += "&"; break;
> case '"': html_str += """; break;
> case '\'': html_str += "'"; break;
> default: html_str += grapheme; break;
> }
> } else {
> html_str += grapheme;
> }
> }
> delete []grapheme;
> res_it->Next(RIL_SYMBOL);
> } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
>
> if ((isItalic &&addConfidence==true) || (!italic && isItalic) || (isItalic && (last_word_in_block || last_word_in_para))){
> html_str += "</strong>";
> isItalic = false;
> }
> /*
> if ((!bold && isBold) || (isBold && (last_word_in_block || last_word_in_para))){
> html_str += "</em>";
> isBold = false;
> }
> */
> if (addConfidence==true){
> html_str += "</font>";
> }
>
> html_str += " ";
>
> if (last_word_in_para) {
> html_str += "</p>\n";
> pcnt++;
> }
> if (last_word_in_block) {
> html_str += "</div>\n";
> bcnt++;
> }
> }
> char *ret = new char[html_str.length() + 1];
> strcpy(ret, html_str.string());
> delete res_it;
> return ret;
> }
>
938,940c1051,1052
< char* TessBaseAPI::GetHOCRText(int page_number) {
< if (tesseract_ == NULL ||
< (page_res_ == NULL && Recognize(NULL) < 0))
---
> char* TessBaseAPI::GetHOCRText(struct ETEXT_DESC* monitor, int page_number) {
> if (tesseract_ == NULL || (page_res_ == NULL && Recognize(monitor) < 0)) {
942c1054
<
---
> }
944a1057
> float row_height, descenders, ascenders;
948c1061
< if (input_file_ == NULL)
---
> if (input_file_ == NULL) {
949a1063
> }
953c1067
< hocr_str += input_file_ ? *input_file_ : "unknown";
---
> hocr_str += input_file_ ? *input_file_ : "android";
982a1097,1101
> res_it->RowAttributes(&row_height,&descenders, &ascenders);
> hocr_str.add_str_int("' font='", 15);
> hocr_str.add_str_int("' size='", row_height);
> hocr_str.add_str_int("' descenders='", descenders * -1);
> hocr_str.add_str_int("' ascenders='", ascenders);
1010c1129
< default: hocr_str += grapheme;
---
> default: hocr_str += grapheme; break;
diff -r tesseract-ocr-r705/api/baseapi.h textfairy/tesseract-ocr-3.02/api/baseapi.h
494c494,498
< char* GetHOCRText(int page_number);
---
> char* GetHOCRText(struct ETEXT_DESC* monitor, int page_number);
>
> char* GetHTMLText(const float minConfidenceToShowColor);
>
>
diff -r tesseract-ocr-r705/ccmain/control.cpp textfairy/tesseract-ocr-3.02/ccmain/control.cpp
245c245,249
< monitor->progress = 30 + 50 * word_index / stats_.word_count;
---
> monitor->progress = 70 * word_index / stats_.word_count;
> if (monitor->progress_callback!=NULL){
> TBOX box = page_res_it.word()->word->bounding_box();
> (*monitor->progress_callback)(monitor->progress,box.left(), box.right(), box.top(), box.bottom());
> }
318c322,325
< monitor->progress = 80 + 10 * word_index / stats_.word_count;
---
> monitor->progress = 70 + 30 * word_index / stats_.word_count;
> if (monitor->progress_callback!=NULL){
> (*monitor->progress_callback)(monitor->progress,0,0,0,0);
> }
diff -r tesseract-ocr-r705/ccmain/ltrresultiterator.cpp textfairy/tesseract-ocr-3.02/ccmain/ltrresultiterator.cpp
163a164,171
> void LTRResultIterator::RowAttributes( float* row_height,
> float* descenders,
> float* ascenders) const{
> *row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() - it_->row()->row->descenders();
> *descenders = it_->row()->row->descenders();
> *ascenders = it_->row()->row->ascenders();
> }
>
diff -r tesseract-ocr-r705/ccmain/ltrresultiterator.h textfairy/tesseract-ocr-3.02/ccmain/ltrresultiterator.h
112a113,114
> void RowAttributes(float* row_height, float* descenders, float* ascenders) const;
>
diff -r tesseract-ocr-r705/ccutil/ocrclass.h textfairy/tesseract-ocr-3.02/ccutil/ocrclass.h
110a111
> typedef bool (*PROGRESS_FUNC)(int progress, int left, int right, int top, int bottom );
119a121
> PROGRESS_FUNC progress_callback;/*called whenever progress increases*/
Binary files tesseract-ocr-r705/tessdata/chi_sim.traineddata and textfairy/tesseract-ocr-3.02/tessdata/chi_sim.traineddata differ