>From a user perspective just stripping the characters seems best to me, but >finding out what the characters seems obnoxious. Neither a quick search nor >skimming the ODT doc specification[1][2] seem to give any insight into a set >of illegal characters. Does elisp have anything similar to Java's >"isWhitespace"[3] that could be used to check character features?
Rasmus <ras...@gmx.us> writes: > torys.ander...@gmail.com (Tory S. Anderson) writes: > >> While we're on the topic of ODT export problems: I was in the process >> of converting PDF to Text to Org to ODT/DocX and discovered that >> certain characters seem to break exported odt documents, which fail >> with a line and col number. So far the only one I know for sure is the >> "" (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle >> all such cases. >> >> You probably don't need it, but I verified with the following file: >> http://toryanderson.com/files/breakorg.org > > The export is fine, but the produced XML is invalid since it contains an > illegal character. But how to resolve this? Should ox strip illegal > charterers (if so what are they)? If so, could they be used for entities? > > —Rasmus Footnotes: [1] https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office [2] http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1415196_253892949 [3] http://www.fileformat.info/info/unicode/char/000c/index.htm