Hi Sean, do you want a ticket for the PTB update?

Cheers,

Britt



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Jul 15, 2015, at 9:07 AM, britt fitch <britt.fi...@wiredinformatics.com> 
> wrote:
> 
> Thanks Sean.
> 
> The other part of the concern is if its reasonable/feasible to alter 
> tokenization rules for things like gene locations. I can work around this in 
> a few ways but if there are other examples of how this might come up in other 
> cases it could be worth looking at a blanket change. Sadly I don’t have 
> another example off the top of my head, maybe organism names? Doing a few 
> queries for terms in the UMLS with periods the majority of them seem to be 
> things you really would want to split on. Perhaps genes are just an edge case.
> 
> I was looking at gene locations overall, not any particular gene or disorder 
> grouping. The term I mentioned was just meant to be an example.
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com <http://wiredinformatics.com/>
> britt.fi...@wiredinformatics.com
> 
>> On Jul 15, 2015, at 8:57 AM, Finan, Sean <sean.fi...@childrens.harvard.edu 
>> <mailto:sean.fi...@childrens.harvard.edu>> wrote:
>> 
>> Hi Britt,
>> 
>> The dictionary should be using ptb tokenization, but I obviously missed a 
>> rule and separated the . from the following 2 in the dictionary.
>> 
>> I will double-check everything.
>> 
>> Sean
>> 
>> p.s. if you don’t mind my asking, are you looking into all connective tissue 
>> disorders or just Shprintzen?
>> 
>> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
>> <mailto:britt.fi...@wiredinformatics.com>]
>> Sent: Tuesday, July 14, 2015 3:58 PM
>> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
>> Subject: periods and the interaction with PTB & Fast Dict Lookup.
>> 
>> Another question/topic likely for Sean & Tim. Happy to get others’ feedback 
>> as well.
>> 
>> I am trying to identify gene related information.
>> 
>> It appears that the PTB tokenization logic in places like the tokenizer & 
>> dictionary building will split a string into multiple tokens if it is not a 
>> number and contains a period.
>> 
>> For example, given “22q11.2 deletion syndrome”:
>> 
>> PTB tokenizer: [22q11, .2, deletion, syndrome]
>> POS for the above term: [CD, CD, NN, NN]
>> Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]
>> 
>> The same string creates a different split of [22q11, ., 2, deletion, 
>> syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens)
>> When the _rareWordTermMap gets created it uses the first token as the key: 
>> 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]
>> 
>> The period-split difference above (period alone vs period + number) might be 
>> irrelevant here because for the input “22q11.2 deletion syndrome”, the 
>> lookup indices are [2,3].
>> The new lookup will ignore incoming tokens “22q11” because its CD and “.2” 
>> because its a number.
>> 
>> It looks like this concept might not be possible to be identified unless CD 
>> is allowed as a lookup token POS.
>> Even if this is allowed though, in the case of gene locations I think the 
>> PTB rules might not be the best fit.
>> 
>> Are there any thoughts/experiences regarding addressing the gene location 
>> mentions like this?
>> Should the Fast Dict tokenization logic match the PTB tokenizer logic to 
>> produce the same components?
>> 
>> Let me know if I read into one of these points wrong. Since these items 
>> would likely cause large changes I am looking to get some feedback before 
>> moving forward.
>> 
>> Cheers,
>> 
>> Britt
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com <http://wiredinformatics.com/>
>> britt.fi...@wiredinformatics.com 
>> <mailto:britt.fi...@wiredinformatics.com><mailto:britt.fi...@wiredinformatics.com
>>  <mailto:britt.fi...@wiredinformatics.com>>
> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to