Another question/topic likely for Sean & Tim. Happy to get others’ feedback as well.
I am trying to identify gene related information. It appears that the PTB tokenization logic in places like the tokenizer & dictionary building will split a string into multiple tokens if it is not a number and contains a period. For example, given “22q11.2 deletion syndrome”: PTB tokenizer: [22q11, .2, deletion, syndrome] POS for the above term: [CD, CD, NN, NN] Chunks for the above term: [B-NP, I-NP, I-NP, I-NP] The same string creates a different split of [22q11, ., 2, deletion, syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens) When the _rareWordTermMap gets created it uses the first token as the key: 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d] The period-split difference above (period alone vs period + number) might be irrelevant here because for the input “22q11.2 deletion syndrome”, the lookup indices are [2,3]. The new lookup will ignore incoming tokens “22q11” because its CD and “.2” because its a number. It looks like this concept might not be possible to be identified unless CD is allowed as a lookup token POS. Even if this is allowed though, in the case of gene locations I think the PTB rules might not be the best fit. Are there any thoughts/experiences regarding addressing the gene location mentions like this? Should the Fast Dict tokenization logic match the PTB tokenizer logic to produce the same components? Let me know if I read into one of these points wrong. Since these items would likely cause large changes I am looking to get some feedback before moving forward. Cheers, Britt Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110 http://wiredinformatics.com britt.fi...@wiredinformatics.com
signature.asc
Description: Message signed with OpenPGP using GPGMail