Re: unicode issues [EXTERNAL]

Finan, Sean Sun, 22 Sep 2019 17:56:48 -0700

Hi Jeritt,

I checked in a change to FileTreeReader.  There is good and bad:  The bad is 
that it will ignore any encoding explicitly set by the user.  The good is that 
it will bypass the buffer-to-String step, so as long as Java figures out the 
encoding there should be no problems with buffers cutting characters in half.


My tests have worked on different 3 encodings, but if anybody out there has 
problems then please let me know.

Thanks again for making me aware of a problem.

Sean

________________________________________
From: Jeritt Thayer <jeritt.tha...@gmail.com>
Sent: Tuesday, September 17, 2019 2:03 PM
To: dev@ctakes.apache.org
Subject: Re: unicode issues [EXTERNAL]

Hi Sean,

Thanks for the information. I was having a similar issue related to "spans" 
occasionally being off by one when running cTAKES 4.0.0 in two different modes 
- a modified entry point for a Spark cluster and validation of a random subset 
using runClinicalPipeline.sh.

I was looking through the FileTreeReader class and noticed something that I 
think may have contributed to the discrepancies. The following line 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakes_blob_7f6dfd7d20253f88c25bea2fdde5cf22b004b63d_ctakes-2Dcore_src_main_java_org_apache_ctakes_core_cr_FileTreeReader.java-23L243&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=AdmrGg-g9T2SpuyempiTz8pKMeK0xDSFufT3r6bAefI&s=KWepFY7D3KjFInbdGF2_T-K-GGpfYgUmREq49VRxP_A&e=
 ) sets a buffer to 8192, which will read in the first 8192 bytes. At that 
point, this first byte array is converted into a string.

What I noticed for some of our documents is that the last position in the byte 
array would occur in the middle of a multiple byte character. As a result, the 
method tries to convert the first part of the character’s bytes to a string on 
the first loop, and then tries to convert the second portion on the second 
iteration. This results in an additional character, which I think is ultimately 
causing our "span" discrepancy.

Does my thought process make sense with your understanding of the code?

Thanks,
Jeritt

On 2019/07/18 21:22:34, "Finan, Sean" <sean.fi...@childrens.harvard.edu> wrote:
> Hi Tim, Remy,
>
> The fake notes have non- UTF-8 formatting in the smoker/ directory.  You can 
> run the default pipeline on those files and look at various outputs (Pretty 
> Text, Pretty Property, Pretty Html) and you will see that ctakes maintains 
> offsets despite the encoding.
>
> The FileTreeReader used by the Default Clinical Pipeline has the ability to 
> read and maintain different encodings as set by the optional parameter 
> "Encoding".  When not specified the encoding goes with the java default, 
> normally UTF-8.
>
> The FileTreeReader actually reads a byte stream, not encoded characters.  By 
> default the -extra- bytes will be put in the document text and ctakes thinks 
> that they are odd (non-alpha ASCII) characters.   Therefore the text offsets 
> will not be messed up.  Individual engines may or may not be impacted by the 
> non-alpha characters.  For instance, I have noticed that cleartk annotators 
> slow down when presented with these documents - e.g. smoker/doc2_*past_smoker 
> has 137 words on 32 lines, but assertion takes 2 full seconds.
>
> I think that the problem arises because the rest interface accepts a posted 
> string (any format / unicode) and no byte -to- UTF-8 is performed.  Each 
> annotator in the pipeline is left up to its own devices with respect to 
> handling or not handling special characters.
>
> We can try to perform a similar conversion (string -to- raw byte, byte to 
> string) in the CtakesRestController.
>
> Sean
>
>
>
> ________________________________
> From: Remy Sanouillet <re...@foreseemed.com>
> Sent: Thursday, July 18, 2019 5:06 PM
> To: dev@ctakes.apache.org
> Subject: Re: unicode issues [EXTERNAL]
>
> From my experience, cTakes is fully capable of dealing with Unicode input 
> since even the default dictionary contains some diacritics and those entries 
> are recognized. My guess is that something is getting lost in translation in 
> the encoding/decoding occuring around the REST api. You have to be very 
> careful with python to specify the correct encoding when doing any Unicode 
> text transfer.
>
> Rémy Sanouillet
> NLP Engineer
> re...@foreseemed.com<mailto:xx...@foreseemed.com>
>
>
> [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
>
> NOTICE: This e-mail message and all attachments transmitted with it are 
> intended solely for the use of the addressee and may contain legally 
> privileged and confidential information. If the reader of this message is not 
> the intended recipient, or an employee or agent responsible for delivering 
> this message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, copying, or other use of this message or its 
> attachments is strictly prohibited. If you have received this message in 
> error, please notify the sender immediately by replying to this message and 
> please delete it from your computer.
>
>
> On Thu, Jul 18, 2019 at 1:47 PM Miller, Timothy 
> <timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu>>
>  wrote:
> Thanks Remy, that makes sense, but I'm wondering why I get the correct 
> offsets in one way of accessing ctakes (the CVD) but the wrong offsets 
> through another way (the REST interface)?
>
> I guess for the fake notes I'm fully in favor of saving as plain text/ascii 
> files to simplify things. But there are more unicode characters than we can 
> write smart rules for and I'd like to make sure unicode strings at least 
> don't screw up offsets, even if we don't process them meaningfully. I'm sure 
> we all look forward to generation Z doctor's notes that use the thumbs 
> up/down emojis for patient prognosis :).
>
> Tim
>
>
>
> -----Original Message-----
> From: Remy Sanouillet 
> <re...@foreseemed.com<mailto:re...@foreseemed.com><mailto:remy%20sanouillet%20%3cre...@foreseemed.com<mailto:remy%2520sanouillet%2520%253cre...@foreseemed.com>%3e>>
> Reply-to: <dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>>
> To: 
> dev@ctakes.apache.org<mailto:dev@ctakes.apache.org><mailto:dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>>
> Subject: Re: unicode issues [EXTERNAL]
> Date: Thu, 18 Jul 2019 13:37:33 -0700
>
> Hi Tim,
>
> What is happening is that your o'clock contains a smart quote (Unicode 
> U+2019) which is encoded as three bytes: 0x6f9980, so you have to take those 
> two extra bytes into account when counting offsets. For that particular 
> character, it is much easier to just preprocess the text and replace all 
> occurrences with the simple apostrophe (ASCII 0x6f). The one on your 
> keyboard. It won't change any interpretation and it makes life simpler for 
> everyone downstream. You probably will want to deal with all extended Unicode 
> characters like emojis otherwise, you will encounter the same offset issues.
>
> Rémy Sanouillet
> NLP Engineer
> re...@foreseemed.com<mailto:re...@foreseemed.com><mailto:xx...@foreseemed.com<mailto:xx...@foreseemed.com>>
>
>
> [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
>
> NOTICE: This e-mail message and all attachments transmitted with it are 
> intended solely for the use of the addressee and may contain legally 
> privileged and confidential information. If the reader of this message is not 
> the intended recipient, or an employee or agent responsible for delivering 
> this message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, copying, or other use of this message or its 
> attachments is strictly prohibited. If you have received this message in 
> error, please notify the sender immediately by replying to this message and 
> please delete it from your computer.
>
>
> On Thu, Jul 18, 2019 at 1:20 PM Miller, Timothy 
> <timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu><mailto:timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu>>>
>  wrote:
> I'm having a weird issue with unicode characters in one of the sample notes 
> distributed with ctakes. The sentence is:
>
> The right breast and axilla were sterilely prepped and draped in the usual 
> standard fashion.  First the right 1 o'clock position 5 cm from the nipple 
> was targeted.  Local anesthesia was obtained with 2% xylocaine.  A small skin 
> incision was made.  Under ultrasound guidance from a medial approach, 2 
> passes with a 14 gauge biopsy device were performed and sent to pathology.  A 
> clip was placed.
>
> The unicode characters are the right single quotes in "o'clock". If I just 
> put it in the CVD everything works fine, e.g. I find the drug "xylocaine" at 
> location 203-212 and it's highlighted correctly. However, if I use the REST 
> interface and send it using the python requests API, I get back the span 
> 205:214. If we then grab that span we get the wrong string (offset by 2, so 
> something like "locaine. "
>
> Any thoughts on where things might be going wrong for the REST interface? 
> Does anyone more knowledgeable than me know how UIMA and cTAKES (and java for 
> that matter) normally handle unicode?
>
> Tim
>
>
>

Re: unicode issues [EXTERNAL]

Reply via email to