Ihor Radchenko <yanta...@posteo.net> writes:

[...]

> +(defconst org-odt-forbidden-char-re
> +  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
> +             (?\N{U+20} . ?\N{U+D7FF})
> +             (?\N{U+E000} . ?\N{U+FFFD})
> +             (?\N{U+10000} . ?\N{U+10FFFF}))))

Indentation mismatch ^

> +  "Regexp matching forbidden XML1.0 characters.
> +https://www.w3.org/TR/REC-xml/#charsets";)
> +
>  (defconst org-odt-schema-dir-list
>    (list (expand-file-name "./schema/" org-odt-data-dir))
>    "List of directories to search for OpenDocument schema files.
> @@ -364,6 +374,19 @@ (defgroup org-export-odt nil
>    :tag "Org Export ODT"
>    :group 'org-export)
>
> +(defcustom org-odt-with-forbidden-chars ""
> +  "String to replace forbidden XML characters.
> +When set to t, forbidden characters are retained.
> +When set to nil, an error is thrown.
> +See `org-odt-forbidden-char-re' for the list of forbidden characters
> +that cannot occur inside ODT documents.
> +
> +You may also consider export filters to perform more fine-grained
> +replacements.  See info node `(org)Advanced Export Configuration'."
> +  :package-version '(Org . "9.8")
> +  :type '(choice (const :tag "Strip forbidden characters" t)

According to the docstring, the above tag should say "Leave forbidden
characters as-is".  See patch which slightly rewords the docstring too.

> +                 (const :tag "Err when forbidden characters encountered" nil)
> +                 (string :tag "Replacement string")))
>
>  ;;;; Debugging
>
> @@ -2892,6 +2915,24 @@ (defun org-odt--encode-tabs-and-spaces (line)
>         (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
>     line))
>
> +(defun org-odt--remove-forbidden (text _backend info)
> +  "Remove forbidden and discouraged characters from TEXT.
> +INFO is the communication plist"
> +  (pcase (plist-get info :odt-with-forbidden-chars)

Should we use pcase-exhaustive?

> +    ((and (pred stringp) rep)
> +     (prog1 (replace-regexp-in-string org-odt-forbidden-char-re rep text)
> +       (when (match-string 0 text)

The replacement appears to work well on my machine, but there are
unnecessary warnings.  Run org-odt-export-to-odt on a buffer containing:

--8<---------------cut here---------------start------------->8---
* foo

bar
--8<---------------cut here---------------end--------------->8---

the (match-string 0 text) form inside org-odt--remove-forbidden evals to

"<?xml version=\"1.0\" "

which causes the incorrect warning message "Warning (ox-odt): Replacing 
forbidden character '' with ''"

Confusingly, `text' and the replacement text are string-equal, so it
appears that no replacement has been made.

I suspect that match-string and replace-regexp-in-string perhaps do not
play well together.  Try this out:

(let* ((text "bar")
       (new (replace-regexp-in-string "r" "z" text)))
  new                    ; "baz", as expected
  (match-string 0 new)   ; signals error
  (match-string 0 text)) ; signals error

I get the following stack trace (for the first error):

Debugger entered--Lisp error: (args-out-of-range "baz" 402 403)
substring("baz" 402 403)
(if string (substring string (match-beginning num) (match-end num)) 
(buffer-substring (match-beginning num) (match-end num)))
(if (match-beginning num) (if string (substring string (match-beginning num) 
(match-end num)) (buffer-substring (match-beginning num) (match-end num))))
match-string(0 "baz")
(let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new 
(match-string 0 new) (match-string 0 text))
(progn (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new 
(match-string 0 new) (match-string 0 text)))
(let ((print-level nil) (print-length nil)) (progn (let* ((text "bar") (new 
(replace-regexp-in-string "r" "z" text))) new (match-string 0 new) 
(match-string 0 text))))
(setq elisp--eval-defun-result (let ((print-level nil) (print-length nil)) 
(progn (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new 
(match-string 0 new) (match-string 0 text)))))
elisp--eval-defun()
#<subr eval-defun>(nil)
edebug--eval-defun(#<subr eval-defun> nil)
apply(edebug--eval-defun #<subr eval-defun> nil)
eval-defun(nil)
funcall-interactively(eval-defun nil)
command-execute(eval-defun)


Also with the replace-regexp-in-string design, there will only be one
warning even with multiple forbidden characters.  See patch below.

> +         (display-warning
> +          '(ox-odt ox-odt-with-forbidden-chars)
> +          (format "Replacing forbidden character '%s' with '%s'"
> +                  (match-string 0 text) rep)))))
> +    (`nil
> +     (if (string-match org-odt-forbidden-char-re text)
> +         (error "Forbidden character '%s' found.  See 
> `org-odt-with-forbidden-chars'"
> +                (match-string 0 text))
> +       text))
> +    (_ text)))
> +
>  (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
>    (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
>      (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
> --
> 2.47.1

>From ce506caa0bffbd243a2aba384f75f7aaac7fdc4b Mon Sep 17 00:00:00 2001
From: Ihor Radchenko <yanta...@posteo.net>
Date: Fri, 27 Dec 2024 10:21:02 +0000
Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml

* lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to
control how to handle forbidden XML characters.
(org-odt--remove-forbidden): New filter removing/replacing forbidden
characters.

Co-authored-by: Joseph Turner <jos...@breatheoutbreathe.in>
Link: https://orgmode.org/list/87o711l4u4....@christianmoe.com
---
 lisp/ox-odt.el | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index ec81637ef..960bab286 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -94,7 +94,8 @@ (org-export-define-backend 'odt
 		    . (org-odt--translate-latex-fragments
 		       org-odt--translate-description-lists
 		       org-odt--translate-list-tables
-		       org-odt--translate-image-links)))
+		       org-odt--translate-image-links))
+                   (:filter-final-output . org-odt--remove-forbidden))
   :menu-entry
   '(?o "Export to ODT"
        ((?o "As ODT file" org-odt-export-to-odt)
@@ -108,6 +109,7 @@ (org-export-define-backend 'odt
     (:keywords "KEYWORDS" nil nil space)
     (:subtitle "SUBTITLE" nil nil parse)
     ;; Other variables.
+    (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars)
     (:odt-content-template-file nil nil org-odt-content-template-file)
     (:odt-display-outline-level nil nil org-odt-display-outline-level)
     (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks)
@@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps
     ("\\.\\.\\." . "&#x2026;"))		; hellip
   "Regular expressions for special string conversion.")
 
+(defconst org-odt-forbidden-char-re
+  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
+               (?\N{U+20} . ?\N{U+D7FF})
+               (?\N{U+E000} . ?\N{U+FFFD})
+               (?\N{U+10000} . ?\N{U+10FFFF}))))
+  "Regexp matching forbidden XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets";)
+
 (defconst org-odt-schema-dir-list
   (list (expand-file-name "./schema/" org-odt-data-dir))
   "List of directories to search for OpenDocument schema files.
@@ -364,6 +374,19 @@ (defgroup org-export-odt nil
   :tag "Org Export ODT"
   :group 'org-export)
 
+(defcustom org-odt-with-forbidden-chars ""
+  "String to replace forbidden XML characters.
+When set to t, forbidden characters are left as-is.
+When set to nil, an error is thrown.
+See `org-odt-forbidden-char-re' for the list of forbidden characters
+that cannot occur inside ODT documents.
+
+You may also consider export filters to perform more fine-grained
+replacements.  See info node `(org)Advanced Export Configuration'."
+  :package-version '(Org . "9.8")
+  :type '(choice (const :tag "Leave forbidden characters as-is" t)
+                 (const :tag "Err when forbidden characters encountered" nil)
+                 (string :tag "Replacement string")))
 
 ;;;; Debugging
 
@@ -2892,6 +2915,32 @@ (defun org-odt--encode-tabs-and-spaces (line)
        (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
    line))
 
+(defun org-odt--remove-forbidden (text _backend info)
+  "Remove forbidden and discouraged characters from TEXT.
+INFO is the communication plist"
+  (pcase-exhaustive (plist-get info :odt-with-forbidden-chars)
+    ((and (pred stringp) rep)
+     (let ((replacements (make-hash-table :test 'equal)))
+       (with-temp-buffer
+         (insert text)
+         (goto-char (point-min))
+         (while (re-search-forward org-odt-forbidden-char-re nil t)
+           (cl-incf (gethash (match-string 0) replacements 0))
+           (replace-match rep))
+         (cl-loop for forbidden being the hash-keys of replacements
+                  using (hash-values count)
+                  do (display-warning
+                      '(ox-odt ox-odt-with-forbidden-chars)
+                      (format "Replaced forbidden character '%s' with '%s' %d times"
+                              forbidden rep count)))
+         (buffer-string))))
+    (`nil
+     (if (string-match org-odt-forbidden-char-re text)
+         (error "Forbidden character '%s' found.  See `org-odt-with-forbidden-chars'"
+                (match-string 0 text))
+       text))
+    ('t text)))
+
 (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
   (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
     (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
-- 
2.46.0

Thank you!!

Joseph

Reply via email to