[ https://issues.apache.org/jira/browse/TIKA-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Oleg Tikhonov updated TIKA-546: ------------------------------- Attachment: TIKA-546.tikhonov.18042011.PATCH 1. Added NGramProfile 2. Added an option into TikaCLI - --createProfile, default values: gramsize = 3. maxlines = 1000. Currently there is no option to change them, 'cause LanguageProfile implementation. 4. Added NGramProfileTest 5. Added TikaCLI test Could anybody have a look at the patch? > Add ability to create language profiles to tika-app > --------------------------------------------------- > > Key: TIKA-546 > URL: https://issues.apache.org/jira/browse/TIKA-546 > Project: Tika > Issue Type: New Feature > Components: cli, languageidentifier > Affects Versions: 0.7 > Reporter: Jan Høydahl > Attachments: TIKA-546.tikhonov.18042011.PATCH > > > Since TIKA-490 it is supposed to be easy adding new language profiles to > TIKA. However, currently the process involves using Nutch's NGramProfile tool > and editing the output. > We should port Nutch's profile builder to Tika and make it part of > tika-app.jar: > # See http://wiki.apache.org/nutch/LanguageIdentifier > # java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] > [--maxlines=<max>] <profile-name> <filename> <encoding> > Using --gramsizes and --maxlines, we could support both Tika-style profiles > and Nutch-style profiles and thus deprecate the Nutch tool. Defaults should > be --gramsizes=3 --maxlines=1000 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira