Hi Jeritt, I checked in a change to FileTreeReader. There is good and bad: The bad is that it will ignore any encoding explicitly set by the user. The good is that it will bypass the buffer-to-String step, so as long as Java figures out the encoding there should be no problems with buffers cutting characters in half.
My tests have worked on different 3 encodings, but if anybody out there has problems then please let me know. Thanks again for making me aware of a problem. Sean ________________________________________ From: Jeritt Thayer <jeritt.tha...@gmail.com> Sent: Tuesday, September 17, 2019 2:03 PM To: dev@ctakes.apache.org Subject: Re: unicode issues [EXTERNAL] Hi Sean, Thanks for the information. I was having a similar issue related to "spans" occasionally being off by one when running cTAKES 4.0.0 in two different modes - a modified entry point for a Spark cluster and validation of a random subset using runClinicalPipeline.sh. I was looking through the FileTreeReader class and noticed something that I think may have contributed to the discrepancies. The following line (https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakes_blob_7f6dfd7d20253f88c25bea2fdde5cf22b004b63d_ctakes-2Dcore_src_main_java_org_apache_ctakes_core_cr_FileTreeReader.java-23L243&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=AdmrGg-g9T2SpuyempiTz8pKMeK0xDSFufT3r6bAefI&s=KWepFY7D3KjFInbdGF2_T-K-GGpfYgUmREq49VRxP_A&e= ) sets a buffer to 8192, which will read in the first 8192 bytes. At that point, this first byte array is converted into a string. What I noticed for some of our documents is that the last position in the byte array would occur in the middle of a multiple byte character. As a result, the method tries to convert the first part of the character’s bytes to a string on the first loop, and then tries to convert the second portion on the second iteration. This results in an additional character, which I think is ultimately causing our "span" discrepancy. Does my thought process make sense with your understanding of the code? Thanks, Jeritt On 2019/07/18 21:22:34, "Finan, Sean" <sean.fi...@childrens.harvard.edu> wrote: > Hi Tim, Remy, > > The fake notes have non- UTF-8 formatting in the smoker/ directory. You can > run the default pipeline on those files and look at various outputs (Pretty > Text, Pretty Property, Pretty Html) and you will see that ctakes maintains > offsets despite the encoding. > > The FileTreeReader used by the Default Clinical Pipeline has the ability to > read and maintain different encodings as set by the optional parameter > "Encoding". When not specified the encoding goes with the java default, > normally UTF-8. > > The FileTreeReader actually reads a byte stream, not encoded characters. By > default the -extra- bytes will be put in the document text and ctakes thinks > that they are odd (non-alpha ASCII) characters. Therefore the text offsets > will not be messed up. Individual engines may or may not be impacted by the > non-alpha characters. For instance, I have noticed that cleartk annotators > slow down when presented with these documents - e.g. smoker/doc2_*past_smoker > has 137 words on 32 lines, but assertion takes 2 full seconds. > > I think that the problem arises because the rest interface accepts a posted > string (any format / unicode) and no byte -to- UTF-8 is performed. Each > annotator in the pipeline is left up to its own devices with respect to > handling or not handling special characters. > > We can try to perform a similar conversion (string -to- raw byte, byte to > string) in the CtakesRestController. > > Sean > > > > ________________________________ > From: Remy Sanouillet <re...@foreseemed.com> > Sent: Thursday, July 18, 2019 5:06 PM > To: dev@ctakes.apache.org > Subject: Re: unicode issues [EXTERNAL] > > From my experience, cTakes is fully capable of dealing with Unicode input > since even the default dictionary contains some diacritics and those entries > are recognized. My guess is that something is getting lost in translation in > the encoding/decoding occuring around the REST api. You have to be very > careful with python to specify the correct encoding when doing any Unicode > text transfer. > > Rémy Sanouillet > NLP Engineer > re...@foreseemed.com<mailto:xx...@foreseemed.com> > > > [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15] > ForeSee Medical, Inc. > 12555 High Bluff Drive, Suite 100 > San Diego, CA 92130 > > NOTICE: This e-mail message and all attachments transmitted with it are > intended solely for the use of the addressee and may contain legally > privileged and confidential information. If the reader of this message is not > the intended recipient, or an employee or agent responsible for delivering > this message to the intended recipient, you are hereby notified that any > dissemination, distribution, copying, or other use of this message or its > attachments is strictly prohibited. If you have received this message in > error, please notify the sender immediately by replying to this message and > please delete it from your computer. > > > On Thu, Jul 18, 2019 at 1:47 PM Miller, Timothy > <timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu>> > wrote: > Thanks Remy, that makes sense, but I'm wondering why I get the correct > offsets in one way of accessing ctakes (the CVD) but the wrong offsets > through another way (the REST interface)? > > I guess for the fake notes I'm fully in favor of saving as plain text/ascii > files to simplify things. But there are more unicode characters than we can > write smart rules for and I'd like to make sure unicode strings at least > don't screw up offsets, even if we don't process them meaningfully. I'm sure > we all look forward to generation Z doctor's notes that use the thumbs > up/down emojis for patient prognosis :). > > Tim > > > > -----Original Message----- > From: Remy Sanouillet > <re...@foreseemed.com<mailto:re...@foreseemed.com><mailto:remy%20sanouillet%20%3cre...@foreseemed.com<mailto:remy%2520sanouillet%2520%253cre...@foreseemed.com>%3e>> > Reply-to: <dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>> > To: > dev@ctakes.apache.org<mailto:dev@ctakes.apache.org><mailto:dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>> > Subject: Re: unicode issues [EXTERNAL] > Date: Thu, 18 Jul 2019 13:37:33 -0700 > > Hi Tim, > > What is happening is that your o'clock contains a smart quote (Unicode > U+2019) which is encoded as three bytes: 0x6f9980, so you have to take those > two extra bytes into account when counting offsets. For that particular > character, it is much easier to just preprocess the text and replace all > occurrences with the simple apostrophe (ASCII 0x6f). The one on your > keyboard. It won't change any interpretation and it makes life simpler for > everyone downstream. You probably will want to deal with all extended Unicode > characters like emojis otherwise, you will encounter the same offset issues. > > Rémy Sanouillet > NLP Engineer > re...@foreseemed.com<mailto:re...@foreseemed.com><mailto:xx...@foreseemed.com<mailto:xx...@foreseemed.com>> > > > [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15] > ForeSee Medical, Inc. > 12555 High Bluff Drive, Suite 100 > San Diego, CA 92130 > > NOTICE: This e-mail message and all attachments transmitted with it are > intended solely for the use of the addressee and may contain legally > privileged and confidential information. If the reader of this message is not > the intended recipient, or an employee or agent responsible for delivering > this message to the intended recipient, you are hereby notified that any > dissemination, distribution, copying, or other use of this message or its > attachments is strictly prohibited. If you have received this message in > error, please notify the sender immediately by replying to this message and > please delete it from your computer. > > > On Thu, Jul 18, 2019 at 1:20 PM Miller, Timothy > <timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu><mailto:timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu>>> > wrote: > I'm having a weird issue with unicode characters in one of the sample notes > distributed with ctakes. The sentence is: > > The right breast and axilla were sterilely prepped and draped in the usual > standard fashion. First the right 1 o'clock position 5 cm from the nipple > was targeted. Local anesthesia was obtained with 2% xylocaine. A small skin > incision was made. Under ultrasound guidance from a medial approach, 2 > passes with a 14 gauge biopsy device were performed and sent to pathology. A > clip was placed. > > The unicode characters are the right single quotes in "o'clock". If I just > put it in the CVD everything works fine, e.g. I find the drug "xylocaine" at > location 203-212 and it's highlighted correctly. However, if I use the REST > interface and send it using the python requests API, I get back the span > 205:214. If we then grab that span we get the wrong string (offset by 2, so > something like "locaine. " > > Any thoughts on where things might be going wrong for the REST interface? > Does anyone more knowledgeable than me know how UIMA and cTAKES (and java for > that matter) normally handle unicode? > > Tim > > >