See https://github.com/tesseract-ocr/tesseract/issues/318 regarding the unicharset format
I was able to do regular tesseract training (not lstm) using tesseract 4.00.00 version from github master and create new unicharset and traineddata with your box/tiff pair. The output on the same tiff file is enclosed. I think you will get better results with the training input text having interword spaces. On Monday, June 19, 2017 at 4:09:29 PM UTC+5:30, David Barishev wrote: > > Hello all! > Im trying to train tesseract to recognize a new font in English ( > supercell-magic). > I have created a .tif file and matching .box file using jTessBoxEditor ( > eng.supercell-magic.exp0.tif > and eng.supercell-magic.exp0.box ), and trained tesseract with them. > > Here is tesseracts's output: > $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0 box.train > Tesseract Open Source OCR Engine v3.04.01 with Leptonica > Page 1 > row xheight=30, but median xheight = 37.5455 > APPLY_BOXES: > Boxes read from boxfile: 1559 > Found 1559 good blobs. > Generated training data for 34 words > Page 2 > APPLY_BOXES: > Boxes read from boxfile: 1677 > Found 1677 good blobs. > Generated training data for 34 words > Page 3 > APPLY_BOXES: > Boxes read from boxfile: 1362 > Found 1362 good blobs. > Generated training data for 28 words > > > So the next step is to extract the characters using unicharset_extractor. > There was a normal output for it : > $ unicharset_extractor eng.supercell-magic.exp0.box > Extracting unicharset from eng.supercell-magic.exp0.box > Wrote unicharset file ./unicharset. > > But when i view the file, it's mostly 0 and 255, which is not like the > example in the wiki > <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file> > > : > An example of the unicharset file > > 110 > NULL 0 NULL 0 > N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N > Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y > 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 > 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 > a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a > ... > > > Mine looks more like this: > > 74 > NULL 0 NULL 0 > Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ] > |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken > t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # t [74 ] > h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # h [68 ] > a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ] > n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # n [6e ] > P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # P [50 ] > o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # o [6f ] > e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ] > : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # : [3a ] > r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # r [72 ] > l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # l [6c ] > i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # i [69 ] > 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # 1 [31 ] > N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # N [4e ] > > Why is that ? Thanks in advances. > > Im using ubuntu 16.04 with tesseract version: > > tesseract 3.04.01 > leptonica-1.73 > libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff > 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0 > > I have attached the box and tiff file and the data file, and the unicharset > file. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f05edc0-1a9e-4524-85e1-9e3510cdd647%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
thanPhone: herlinelNofeelgoodWa5hingtoncontentuse310ntheirFinance5tates Shipping% 125mVocationalread6NowspecialThatMaRCHGeneralbetter336699fo nt: DetailsLawdidyourendPriceRSanPeopleof14801301A2EBo300Named eWithprivatelinkJuneUnitedresultsubject (241ittleSUCHseeordersecond270JulydetailsNhereinaddressmeook srightsincludingMWhUpCategoriesStateRating: Registerlookingpartnathansonda5hboard0nlyrightfrienddoeskee plike21?andtake65, IfedenmaphisPMdaylinksSerbicepagesinfoKidsRightsinfocurrentHe reNodue5iganealth5hehoursmusicthinkpropertYCAor:: wayAmericatotal40CVBytiedusing2butitemprofileClickoffManagem entyear5hand- offsyellowdesigniGoDigitalJobWebmaNYEntertainmentprovidesWh aterkVideohands- off19nameThisAD7575TQ8830nlinecontrol2004ifButWHileProducth e?CHildrennon-discriminatiQNHi! DabidNovrate15processArticleqQarticleanothercaseuponrelatedw ouldHelpjustdifferenthiHousemustbookbeen5erbiceslNincludewas calledMUCHh(x)4+xaccessmadehometheseXQHigh12)) howfewPersonalInternationalreviewsstoresListingsupportFpoint 100?LastItInformationwebsiteservices(3) Universitysiteyearyoure200nesameoutonlyosteron)))) MidnightVfind50atPolicyalreadyGianluigiZanettinilifehotel50meBK 2000versiongot(222)assignmentYIargeitfollowing:Id: ZWhiteAfterTravelstateNetworkcompanyInternationalHavanaEqui pmentthemPhotooctoberVaeberyAllserbiceTheresystemFrom: alsoAvvertiseWHenPrintz6NationalhelpthreePagecouldInternetas NextCompanybecauseGamefoundTerms5upportEducationrJohneX! XzKeqiaoBIZz! BoxajmillionU5down28ProductsCaliforniaNewsreservedreturnArt sreportsoftwaresomepublic+healthHotel55heEventsEstateperson DownloadtooCityfij0urlessuser3FORJoinsportssmallupProjectSe lectthenoldBedford, nbspEnglandNEWoverNot+haveits29makingShoesToysinformationB uy26aboveReferencePostedbYTVinfobunorldInsuranceeaCHCarC omputersnothost2001pricenewsfocusl9982004please+4hasaream embersWhenbestFirstAprPreSSCHangessureDi- carAsMainInMayn960ateANDnoLegalfiletherehongZHizai5hang996, 2006yoquknowstillHecallDirectory5ecuritypolicyprogramlook55P: 49900(58633LoginlistItemsmembergoEimagecommentsDVD?MAKE:d: bXdevelopmentgreatrealstudentsfor5ystemsRealCompareseaRCH)) becomeWestadditweresiteowithout00worldhere16Totalcareother qualityofleft3HistoryzroemailImageDesigndontMedical))27:308(B), basedonline(4,7,3,1,9,6,2,8,5) postCollegeZNewtwoDobusinessReportaccountlastTimeCouncilArt 9underFREEIncyetComputerputToGetAD,,EG,,HL,,MP,, QqullealueBest5atitsUsemessagestudyQtimeMicrosoftSpecialIsC ontact60nowmyCarelow?5hop(eur0841)IncludingCustomer% SCHool5aveoneq! ItsCanadaU5AstoreHsectionprovideReadlongffound- abortingCHeck))rbha5h5hift(ha5h)Name: anynewdigitaldistractionsanti-inflationar% makeLibraRYIIProgramnetwor8WELTMEI5TER30Usassimilationists, AnEastveRYfeedbackdateWAndafterjobtoNumber5PleaseBTHEPCf ebJResultSWHereseveral31764 (1680nextWindowsPriceswithinsothingsGamesbothReturn5Howeve r,5eaRCH,sdayst,nd,rd,th,ZX+!XBrowseEmailpayalwaysmayq)) KornLocalRelatedResourcesTeCHnologyFullalltreeXynevergivenInd ex5alesthroughTopvideoDescription- Wefree1999Title0therisAnuunciGayItaliani14termsplaYXbox, around5ee5hallfromThese2,500,000,000lifeFreefamilthim13Price: musicmoresuitcase?CanadianThephotos: ZYjZbesaidcannotisetNorthpeopleabledatamoneyJobSHadhome5K- 10+50C,sincemight+yQur0Back1,idontunderstand?daYSHbar: similarGuildford, 0xfordzroSouthamCheckGardenMessagesomethingpostedsystemG reatperx-747367,by:Visitan)Blogme%productRateINTL: itHowtaohihi!!905hngathatagainstbeforeouranti-inflationar% 8provideddoMa% RYVzduringarticlesStockpersonalDepartment23Post! 0rdercomputerFAQPrivacyshippingreseaRCHgameCJCopyrightWHat Novemberthose2005LyricsAccountFoodeBaytype4R55Commentsma ingetCurrent5treetPAbout5hould))memberRe:ListReserved(2) ifoaoaoa:)0fficemeansjAprilMovies2003faith-bufprlaceso,1,2,,t- IthatArticlestheyworkinngenter5Weathercannumber0peninitialt ransition))havingAMAQueo!5torecontactLearn17Business)) ReviewsAccessoriesworkstartPublicEsioff- LonGuidepricesoqtopGroquahoo!MJournalresultswithInfooFY? ScienceAugustpicturesneedAdvancedopenPagesl4001950(Canada) 0verallTheYWHyAmerican1? 2areSHQWSCHoolcreditused22MoreleastReseaRCHto+MostMapChang e5alesafety-significant5tmilesDaYTools%highIFondation- 50lidaritMrtheLinkRownIteMWHiCH15416956- 702841FebruaryMy10hCountyAiranti- authoritarianreaIIYpageEngliShCartDatacomeaboutus25untiltimes hebackAddForLinksFindclickElectronicsthisViewBoardimportantPr ofileDate: BlackyQuRightsForumsgrouppoweritemssexrequiredJanuaRY% METATITLE%saYproductsavailablewaterdoxal-5- phosphateMedia+JanBookPacificAfghanistantext(1) T5endUK7Replysendformsincefirstintention-in-actionbetweenPad, Standard,14x27EagoingQintos- TGTAACCTCTACTCCCAtransnationalizationsolovelocalImbeingT0web