Got the error during processing of a large set of documents about mid-way
through:
org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 character: ,
0x1c
I encountered this once before, but I don't remember what the fix was.
Running apache-ctakes-4.0.1-SNAPSHOT.
Thanks!
Greg--
--
Gre
I’ve encountered that when the input text file has control characters, for
example ^M
The fix I used was to remove all control characters from the input text files
ahead of time via python.
Best,
John Caskey
UW-Madison
jrcas...@wisc.edu
From: Greg Silverman
Sen
Hi John,
I thought I did. I'm using a pandas dataframe and passing it through this:
files['note_text'] = files['note_text'].apply(lambda x:
x.replace('[^\x00-\x7F]','')) ... obviously it wasn't enough.
Any suggestions?
Thanks!
Greg--
On Sun, Mar 6, 2022 at 2:46 PM JOHN R CASKEY
wrote:
> I’ve e
Hi Greg,
I created a class based on https://stackoverflow.com/a/93029 (see attached).
The usage could be:
df_text = df[‘TEXT_FIELD’].tolist()
cleaned = [XMLcleaner(x).xmlstring for x in df_text]
df[‘TEXT_FIELD’] = cleaned
Best,
John
From: Greg Silverman
Date: Sunday, March 6, 2022 at 5:10 PM
Awesome! Thanks a ton.
Greg--
On Sun, Mar 6, 2022 at 6:11 PM JOHN R CASKEY
wrote:
> Hi Greg,
>
>
>
> I created a class based on https://stackoverflow.com/a/93029 (see
> attached). The usage could be:
>
>
>
> df_text = df[‘TEXT_FIELD’].tolist()
>
> cleaned = [XMLcleaner(x).xmlstring for x in df_