Issue with serializable XML

2022-03-06 Thread Greg Silverman
Got the error during processing of a large set of documents about mid-way through: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 character: , 0x1c I encountered this once before, but I don't remember what the fix was. Running apache-ctakes-4.0.1-SNAPSHOT. Thanks! Greg-- -- Gre

Re: Issue with serializable XML

2022-03-06 Thread JOHN R CASKEY
I’ve encountered that when the input text file has control characters, for example ^M The fix I used was to remove all control characters from the input text files ahead of time via python. Best, John Caskey UW-Madison jrcas...@wisc.edu From: Greg Silverman Sen

Re: Issue with serializable XML

2022-03-06 Thread Greg Silverman
Hi John, I thought I did. I'm using a pandas dataframe and passing it through this: files['note_text'] = files['note_text'].apply(lambda x: x.replace('[^\x00-\x7F]','')) ... obviously it wasn't enough. Any suggestions? Thanks! Greg-- On Sun, Mar 6, 2022 at 2:46 PM JOHN R CASKEY wrote: > I’ve e

Re: Issue with serializable XML

2022-03-06 Thread JOHN R CASKEY
Hi Greg, I created a class based on https://stackoverflow.com/a/93029 (see attached). The usage could be: df_text = df[‘TEXT_FIELD’].tolist() cleaned = [XMLcleaner(x).xmlstring for x in df_text] df[‘TEXT_FIELD’] = cleaned Best, John From: Greg Silverman Date: Sunday, March 6, 2022 at 5:10 PM

Re: Issue with serializable XML

2022-03-06 Thread Greg Silverman
Awesome! Thanks a ton. Greg-- On Sun, Mar 6, 2022 at 6:11 PM JOHN R CASKEY wrote: > Hi Greg, > > > > I created a class based on https://stackoverflow.com/a/93029 (see > attached). The usage could be: > > > > df_text = df[‘TEXT_FIELD’].tolist() > > cleaned = [XMLcleaner(x).xmlstring for x in df_