See https://github.com/tesseract-ocr/tesseract/issues/318
regarding the unicharset format

I was able to do regular tesseract training (not lstm) using tesseract 
4.00.00 version from github master and create new unicharset and 
traineddata with your box/tiff pair. The output on the same tiff file is 
enclosed.

I think you will get better results with the training input text having 
interword spaces.

On Monday, June 19, 2017 at 4:09:29 PM UTC+5:30, David Barishev wrote:
>
> Hello all!
> Im trying to train tesseract to recognize a new font in English (
> supercell-magic).
> I have created a .tif file and matching .box file using jTessBoxEditor ( 
> eng.supercell-magic.exp0.tif 
> and  eng.supercell-magic.exp0.box ), and trained tesseract with them.
>
> Here is tesseracts's output:
> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0 box.train
> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
> Page 1
> row xheight=30, but median xheight = 37.5455
> APPLY_BOXES:
>    Boxes read from boxfile:    1559
>    Found 1559 good blobs.
> Generated training data for 34 words
> Page 2
> APPLY_BOXES:
>    Boxes read from boxfile:    1677
>    Found 1677 good blobs.
> Generated training data for 34 words
> Page 3
> APPLY_BOXES:
>    Boxes read from boxfile:    1362
>    Found 1362 good blobs.
> Generated training data for 28 words
>
>
> So the next step is to extract the characters using unicharset_extractor.
> There was a normal output for it :
> $ unicharset_extractor eng.supercell-magic.exp0.box
> Extracting unicharset from eng.supercell-magic.exp0.box
> Wrote unicharset file ./unicharset.
>
> But when i view the file, it's mostly 0 and 255, which is not like the 
> example in the wiki 
> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file>
>  
> : 
> An example of the unicharset file
>
> 110
> NULL 0 NULL 0
> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
> ...
>
>
> Mine looks more like this:
>
> 74
> NULL 0 NULL 0
> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # Joined [4a 6f 69 6e 65 64 ]
> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0      # Broken
> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # t [74 ]
> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # h [68 ]
> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # a [61 ]
> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # n [6e ]
> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # P [50 ]
> o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # o [6f ]
> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # e [65 ]
> : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # : [3a ]
> r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # r [72 ]
> l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # l [6c ]
> i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # i [69 ]
> 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # 1 [31 ]
> N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # N [4e ]
>
> Why is that ? Thanks in advances.
>
> Im using ubuntu 16.04 with tesseract version:
>
> tesseract 3.04.01
>  leptonica-1.73
>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 
> 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>
>  I have attached the box and tiff file and the data file, and the unicharset 
> file.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5f05edc0-1a9e-4524-85e1-9e3510cdd647%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
thanPhone:
herlinelNofeelgoodWa5hingtoncontentuse310ntheirFinance5tates
Shipping%
125mVocationalread6NowspecialThatMaRCHGeneralbetter336699fo
nt:
DetailsLawdidyourendPriceRSanPeopleof14801301A2EBo300Named
eWithprivatelinkJuneUnitedresultsubject
(241ittleSUCHseeordersecond270JulydetailsNhereinaddressmeook
srightsincludingMWhUpCategoriesStateRating:
Registerlookingpartnathansonda5hboard0nlyrightfrienddoeskee
plike21?andtake65,
IfedenmaphisPMdaylinksSerbicepagesinfoKidsRightsinfocurrentHe
reNodue5iganealth5hehoursmusicthinkpropertYCAor::
wayAmericatotal40CVBytiedusing2butitemprofileClickoffManagem
entyear5hand-
offsyellowdesigniGoDigitalJobWebmaNYEntertainmentprovidesWh
aterkVideohands-
off19nameThisAD7575TQ8830nlinecontrol2004ifButWHileProducth
e?CHildrennon-discriminatiQNHi!
DabidNovrate15processArticleqQarticleanothercaseuponrelatedw
ouldHelpjustdifferenthiHousemustbookbeen5erbiceslNincludewas
calledMUCHh(x)4+xaccessmadehometheseXQHigh12))
howfewPersonalInternationalreviewsstoresListingsupportFpoint
100?LastItInformationwebsiteservices(3)
Universitysiteyearyoure200nesameoutonlyosteron))))
MidnightVfind50atPolicyalreadyGianluigiZanettinilifehotel50meBK
2000versiongot(222)assignmentYIargeitfollowing:Id:
ZWhiteAfterTravelstateNetworkcompanyInternationalHavanaEqui
pmentthemPhotooctoberVaeberyAllserbiceTheresystemFrom:
alsoAvvertiseWHenPrintz6NationalhelpthreePagecouldInternetas
NextCompanybecauseGamefoundTerms5upportEducationrJohneX!
XzKeqiaoBIZz!
BoxajmillionU5down28ProductsCaliforniaNewsreservedreturnArt
sreportsoftwaresomepublic+healthHotel55heEventsEstateperson

DownloadtooCityfij0urlessuser3FORJoinsportssmallupProjectSe
lectthenoldBedford,
nbspEnglandNEWoverNot+haveits29makingShoesToysinformationB
uy26aboveReferencePostedbYTVinfobunorldInsuranceeaCHCarC
omputersnothost2001pricenewsfocusl9982004please+4hasaream
embersWhenbestFirstAprPreSSCHangessureDi-
carAsMainInMayn960ateANDnoLegalfiletherehongZHizai5hang996,
2006yoquknowstillHecallDirectory5ecuritypolicyprogramlook55P:
49900(58633LoginlistItemsmembergoEimagecommentsDVD?MAKE:d:
bXdevelopmentgreatrealstudentsfor5ystemsRealCompareseaRCH))
becomeWestadditweresiteowithout00worldhere16Totalcareother
qualityofleft3HistoryzroemailImageDesigndontMedical))27:308(B),
basedonline(4,7,3,1,9,6,2,8,5)
postCollegeZNewtwoDobusinessReportaccountlastTimeCouncilArt
9underFREEIncyetComputerputToGetAD,,EG,,HL,,MP,,
QqullealueBest5atitsUsemessagestudyQtimeMicrosoftSpecialIsC
ontact60nowmyCarelow?5hop(eur0841)IncludingCustomer%
SCHool5aveoneq!
ItsCanadaU5AstoreHsectionprovideReadlongffound-
abortingCHeck))rbha5h5hift(ha5h)Name:
anynewdigitaldistractionsanti-inflationar%
makeLibraRYIIProgramnetwor8WELTMEI5TER30Usassimilationists,
AnEastveRYfeedbackdateWAndafterjobtoNumber5PleaseBTHEPCf
ebJResultSWHereseveral31764
(1680nextWindowsPriceswithinsothingsGamesbothReturn5Howeve
r,5eaRCH,sdayst,nd,rd,th,ZX+!XBrowseEmailpayalwaysmayq))
KornLocalRelatedResourcesTeCHnologyFullalltreeXynevergivenInd
ex5alesthroughTopvideoDescription-
Wefree1999Title0therisAnuunciGayItaliani14termsplaYXbox,
around5ee5hallfromThese2,500,000,000lifeFreefamilthim13Price:
musicmoresuitcase?CanadianThephotos:
ZYjZbesaidcannotisetNorthpeopleabledatamoneyJobSHadhome5K-
10+50C,sincemight+yQur0Back1,idontunderstand?daYSHbar:
similarGuildford,

0xfordzroSouthamCheckGardenMessagesomethingpostedsystemG
reatperx-747367,by:Visitan)Blogme%productRateINTL:
itHowtaohihi!!905hngathatagainstbeforeouranti-inflationar%
8provideddoMa%
RYVzduringarticlesStockpersonalDepartment23Post!
0rdercomputerFAQPrivacyshippingreseaRCHgameCJCopyrightWHat
Novemberthose2005LyricsAccountFoodeBaytype4R55Commentsma
ingetCurrent5treetPAbout5hould))memberRe:ListReserved(2)
ifoaoaoa:)0fficemeansjAprilMovies2003faith-bufprlaceso,1,2,,t-
IthatArticlestheyworkinngenter5Weathercannumber0peninitialt
ransition))havingAMAQueo!5torecontactLearn17Business))
ReviewsAccessoriesworkstartPublicEsioff-
LonGuidepricesoqtopGroquahoo!MJournalresultswithInfooFY?
ScienceAugustpicturesneedAdvancedopenPagesl4001950(Canada)
0verallTheYWHyAmerican1?
2areSHQWSCHoolcreditused22MoreleastReseaRCHto+MostMapChang
e5alesafety-significant5tmilesDaYTools%highIFondation-
50lidaritMrtheLinkRownIteMWHiCH15416956-
702841FebruaryMy10hCountyAiranti-
authoritarianreaIIYpageEngliShCartDatacomeaboutus25untiltimes
hebackAddForLinksFindclickElectronicsthisViewBoardimportantPr
ofileDate:
BlackyQuRightsForumsgrouppoweritemssexrequiredJanuaRY%
METATITLE%saYproductsavailablewaterdoxal-5-
phosphateMedia+JanBookPacificAfghanistantext(1)
T5endUK7Replysendformsincefirstintention-in-actionbetweenPad,
Standard,14x27EagoingQintos-
TGTAACCTCTACTCCCAtransnationalizationsolovelocalImbeingT0web

Reply via email to