> > my question is whether there's a place where one can register specific two > character terms, for example BP or PT which will be found even with a > window size set to three.
My brute-force approach is pretty brutal: Change the window size to two, annotate terms, then remove all two-letter annotations except the very few I'm interested in. On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <pabramowit...@gmail.com> wrote: > Hello all > > Is there a mechanism, a lookup file, etc which overrides the window size > set on the term annotator or the chunker. Changing the window size from > the default of 3 to 2 opens the floodgate to false acronym annotations. So > my question is whether there's a place where one can register specific two > character terms, for example BP or PT which will be found even with a > window size set to three. > > A similar question about Genes. On adding the HGNC vocabulary I notice > that there are many thousands of aliases for genes which overlap other > common acronyms and english words such as trip, spring, plan, bed, yes, > rip, prn etc. I'm not sure if these aliases are ever used. So I created > a sed script with 4000 regex expressions to remove the 2 and 3 letter gene > synonyms from a script file. I will only suppress the 4 letter synonyms > manually where they cause trouble. But does anyone have a more elegant > solution? > > Peter >