Hi Greg,

I created a class based on https://stackoverflow.com/a/93029 (see attached). 
The usage could be:

df_text = df[‘TEXT_FIELD’].tolist()
cleaned = [XMLcleaner(x).xmlstring for x in df_text]
df[‘TEXT_FIELD’] = cleaned

Best,
John

From: Greg Silverman <g...@umn.edu.INVALID>
Date: Sunday, March 6, 2022 at 5:10 PM
To: jrcas...@medicine.wisc.edu.invalid <jrcas...@medicine.wisc.edu.invalid>
Cc: dev@ctakes.apache.org <dev@ctakes.apache.org>
Subject: Re: Issue with serializable XML
Hi John,
I thought I did. I'm using a pandas dataframe and passing it through this:
files['note_text'] = files['note_text'].apply(lambda x:
x.replace('[^\x00-\x7F]','')) ... obviously it wasn't enough.
Any suggestions?

Thanks!

Greg--

On Sun, Mar 6, 2022 at 2:46 PM JOHN R CASKEY
<jrcas...@medicine.wisc.edu.invalid> wrote:

> I’ve encountered that when the input text file has control characters, for
> example ^M
>
> The fix I used was to remove all control characters from the input text
> files ahead of time via python.
>
> Best,
> John Caskey
> UW-Madison
> jrcas...@wisc.edu
> ________________________________
> From: Greg Silverman <g...@umn.edu.INVALID>
> Sent: Sunday, March 6, 2022 12:40:00 PM
> To: dev@ctakes.apache.org <dev@ctakes.apache.org>
> Subject: Issue with serializable XML
>
> Got the error during processing of a large set of documents about mid-way
> through:
> org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 character: ,
> 0x1c
>
> I encountered this once before, but I don't remember what the fix was.
> Running apache-ctakes-4.0.1-SNAPSHOT.
>
> Thanks!
>
> Greg--
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
> Department of Surgery
> University of Minnesota
> g...@umn.edu
>


--
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
g...@umn.edu
import sys
import itertools
import re

class XMLcleaner:
  def __init__(self, xmlstring=''):
    self.xmlstring = xmlstring
  @property
  def xmlstring(self):
    return self._xmlstring
  @xmlstring.setter
  def xmlstring(self, s):
    all_chars = (chr(i) for i in range(sys.maxunicode))
    categories = {'Cc'}
    control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), 
range(0x7f,0xa0))))
    control_char_re = re.compile('[%s]' % re.escape(control_chars))
    s_ = ''
    try:
      s_ = control_char_re.sub('',s)
      self.xmlstring = s_
    except TypeError:
      self._xmlstring = s

Reply via email to