Hi Greg, I created a class based on https://stackoverflow.com/a/93029 (see attached). The usage could be:
df_text = df[‘TEXT_FIELD’].tolist() cleaned = [XMLcleaner(x).xmlstring for x in df_text] df[‘TEXT_FIELD’] = cleaned Best, John From: Greg Silverman <g...@umn.edu.INVALID> Date: Sunday, March 6, 2022 at 5:10 PM To: jrcas...@medicine.wisc.edu.invalid <jrcas...@medicine.wisc.edu.invalid> Cc: dev@ctakes.apache.org <dev@ctakes.apache.org> Subject: Re: Issue with serializable XML Hi John, I thought I did. I'm using a pandas dataframe and passing it through this: files['note_text'] = files['note_text'].apply(lambda x: x.replace('[^\x00-\x7F]','')) ... obviously it wasn't enough. Any suggestions? Thanks! Greg-- On Sun, Mar 6, 2022 at 2:46 PM JOHN R CASKEY <jrcas...@medicine.wisc.edu.invalid> wrote: > I’ve encountered that when the input text file has control characters, for > example ^M > > The fix I used was to remove all control characters from the input text > files ahead of time via python. > > Best, > John Caskey > UW-Madison > jrcas...@wisc.edu > ________________________________ > From: Greg Silverman <g...@umn.edu.INVALID> > Sent: Sunday, March 6, 2022 12:40:00 PM > To: dev@ctakes.apache.org <dev@ctakes.apache.org> > Subject: Issue with serializable XML > > Got the error during processing of a large set of documents about mid-way > through: > org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 character: , > 0x1c > > I encountered this once before, but I don't remember what the fix was. > Running apache-ctakes-4.0.1-SNAPSHOT. > > Thanks! > > Greg-- > > -- > Greg M. Silverman > Senior Systems Developer > NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group> > Department of Surgery > University of Minnesota > g...@umn.edu > -- Greg M. Silverman Senior Systems Developer NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group> Department of Surgery University of Minnesota g...@umn.edu
import sys import itertools import re class XMLcleaner: def __init__(self, xmlstring=''): self.xmlstring = xmlstring @property def xmlstring(self): return self._xmlstring @xmlstring.setter def xmlstring(self, s): all_chars = (chr(i) for i in range(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0)))) control_char_re = re.compile('[%s]' % re.escape(control_chars)) s_ = '' try: s_ = control_char_re.sub('',s) self.xmlstring = s_ except TypeError: self._xmlstring = s