Is there a way to stop some hyphenated terms from being tokenized

Tang, Rebecca Wed, 05 Nov 2014 14:27:56 -0800

Hi there,

For some hyphenated terms, I want them to stay as is instead of being 
tokenized.  For example: e-cigarette, e-cig, I-pad.  I don't want them to be 
split into e and cig or I and pad  because the single letter e and I produces 
too many false positive matches.


Is there a way to tell the standard tokenizer to skip tokenizing some terms?

Rebecca Tang
Applications Developer, UCSF CKM
Legacy Tobacco Document Library<legacy.library.ucsf.edu/>
E: rebecca.t...@ucsf.edu

Is there a way to stop some hyphenated terms from being tokenized

Reply via email to