Dear LyX developers,

the attached patch cleans up lyx2lyx issues after the recent fixes to
dash handling in LyX. Commiting requires the +1 from at least one other
developer.

* Backwards compatibility for both, documents containing literal dashes and
  documents containing ligature dashes.
  
  Currently, "\use_dash_ligatures" is set based on the original file version.
  If you used literal em- and en-dashes in pre-2.2 documents,
  you must manually unselect "Output em- and en-dash as ligatures" to
  ensure unchanged behaviour.
  
  The patch ensures content is scanned for literal and ligature dashes
  and the setting set to ensure unchanged line breaks.
    
  Pre-LyX 2.2 documents with both, literal AND ligature dashes   
  trigger a warning and uses the default value for "\use_dash_ligatures".
  We could also consider ERT in these (rare) cases.  
  
* Round-trip 2.3 -> <older format> -> 2.3 keeps "\use_dash_ligatures"
  value.
  
  Currently, , the original value of the setting is lost: 
  2.3 -> 2.2 -> 2.3 forces "\use_dash_ligatures false" and 
  2.3 -> 2.1 (and older) -> 2.3 forces "\use_dash_ligatures true".

* Backwards compatibility for 2.2 via preamble code.
  
  Currently, the 2.2 workaround uses zero width space (ZWSP) characters.
  
  Re-defining \textemdash and \textendash ensures unchanged output also
  regarding hyphenation of words adjacent to the dashes. The preamble code
  is removed when converting from 2.2 (both directions).

* Conversion 2.3 -> 2.1 (or older) produces ligature dashes if
  "\use_dash_ligatures true".
  
  Currently, the 2.2 workaround with literal dash + ZWSP is also used for
  export to 2.1 and older with suboptimal results and problems with the
  ZWSP character in 2.0 and earlier.


The patch allows removal of all dash-related caveats in the 2.3 RELEASE NOTES.

Please try it out and give a +1 or improvement suggestions.

Günter


----- End forwarded message -----
>From 0691a3537cbadc0336edd9f47b14e8047a39cad2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?G=C3=BCnter=20Milde?= <mi...@lyx.org>
Date: Sat, 30 Sep 2017 23:26:02 +0200
Subject: [PATCH] Fix lyx2lyx conversion of dashes.

---
 lib/RELEASE-NOTES      |  32 ++---------
 lib/lyx2lyx/lyx_2_2.py |   6 ++
 lib/lyx2lyx/lyx_2_3.py | 153 ++++++++++++++++++++++++-------------------------
 3 files changed, 87 insertions(+), 104 deletions(-)

diff --git a/lib/RELEASE-NOTES b/lib/RELEASE-NOTES
index 440b93e71a..007ae575ee 100644
--- a/lib/RELEASE-NOTES
+++ b/lib/RELEASE-NOTES
@@ -13,13 +13,12 @@
   be safely dissolved, as it will be automatically inserted at export time
   if needed, as usual.
 
-* The new setting
-  "Document->Settings->Fonts->Output em- and en-dash as ligatures" forces
-  output of en- and em-dashes as -- and --- when exporting to LaTeX.
-  It is is "true" by default but "false" when opening documents edited
-  with LyX 2.2.
-  See chapter 3.9.1.1 "Dashes and line breaks" of the User Guide and
-  "Caveats when upgrading from earlier versions to 2.3.x" below.
+* The new setting "Output em- and en-dash as ligatures" under
+  "Document->Settings->Fonts" forces output of en and em dashes as -- and
+  --- when exporting to LaTeX. The default is "true". When opening old
+  documents, the setting is "false" if literal dashes were used and
+  different line breaks might occur. See chapter 3.9.1.1 "Dashes and line
+  breaks" of the User Guide for details.
 
 * The following UI translations were dropped, because the lack of translation
   maintenance:  Russian, Danish, Greek, Serbian, Galician, Catalan, Romanian,
@@ -209,25 +208,6 @@
   the external_templates file, you will have to move the modifications to
   the respective *.xtemplate file manually.
 
-* If you used literal em- and en-dashes in pre-2.2 documents,
-  you must manually unselect
-  "Document->Settings->Fonts->Output em- and en-dash as ligatures"
-  to ensure unchanged behaviour.
-
-* ZWSP characters (u200b) following literal em- and en-dashes are deleted by
-  lyx2lyx when converting to 2.3 format. If you used them as optional line
-  breaks after dashes, convert them to space insets before opening your
-  document with LyX 2.3 or the optional line breaks will be lost!
-
-* If using TeX fonts and en- and em-dashes are output as font ligatures,
-  when exporting documents containing en- and em-dashes to the format of
-  LyX 2.0 or earlier, the following line has to be manually added to the
-  unicodesymbols file of that LyX version:<br>
-  0x200b "\\hspace{0pt}" "" "" "" "" # ZERO WIDTH SPACE<br>
-  This avoids "uncodable character" issues if the document is actually
-  loaded by that LyX version. LyX 2.1 and later versions already have the
-  necessary definition in their unicodesymbols file.
-
 * If trying to compile documents using R scripts and sweave/knitr, LyX
   2.3.x would not allow for re-running the R scripts, unless the user:
   1) explicitly disables the "Forbid use of needauth converters"
diff --git a/lib/lyx2lyx/lyx_2_2.py b/lib/lyx2lyx/lyx_2_2.py
index 996c22684e..2f4ef3ac2a 100644
--- a/lib/lyx2lyx/lyx_2_2.py
+++ b/lib/lyx2lyx/lyx_2_2.py
@@ -659,6 +659,12 @@ def convert_dashes(document):
 def revert_dashes(document):
     "convert \\twohyphens and \\threehyphens to -- and ---"
 
+    # eventually remove preamble code from 2.3->2.2 conversion:
+    for i, line in enumerate(document.preamble):
+        if i > 1 and line == r'\renewcommand{\textemdash}{---}':
+            if (document.preamble[i-1] == r'\renewcommand{\textendash}{--}'
+                and document.preamble[i-2] == '% Added by lyx2lyx'):
+                del document.preamble[i-2:i+1]
     i = 0
     while i < len(document.body):
         words = document.body[i].split()
diff --git a/lib/lyx2lyx/lyx_2_3.py b/lib/lyx2lyx/lyx_2_3.py
index edc5b1ffa9..735a34f54a 100644
--- a/lib/lyx2lyx/lyx_2_3.py
+++ b/lib/lyx2lyx/lyx_2_3.py
@@ -1841,58 +1841,63 @@ def revert_chapterbib(document):
 
 
 def convert_dashligatures(document):
-    " Remove a zero-length space (U+200B) after en- and em-dashes. "
-
-    i = find_token(document.header, "\\use_microtype", 0)
-    if i != -1:
-        if document.initial_format > 474 and document.initial_format < 509:
-            # This was created by LyX 2.2
-            document.header[i+1:i+1] = ["\\use_dash_ligatures false"]
-        else:
-            # This was created by LyX 2.1 or earlier
-            document.header[i+1:i+1] = ["\\use_dash_ligatures true"]
-
-    i = 0
-    while i < len(document.body):
-        words = document.body[i].split()
-        # Skip some document parts where dashes are not converted
-        if len(words) > 1 and words[0] == "\\begin_inset" and \
-           words[1] in ["CommandInset", "ERT", "External", "Formula", \
-                        "FormulaMacro", "Graphics", "IPA", "listings"]:
-            j = find_end_of_inset(document.body, i)
-            if j == -1:
-                document.warning("Malformed LyX document: Can't find end of " \
-                                 + words[1] + " inset at line " + str(i))
-                i += 1
-            else:
-                i = j
-            continue
-        if len(words) > 0 and words[0] in ["\\leftindent", \
-                "\\paragraph_spacing", "\\align", "\\labelwidthstring"]:
-            i += 1
-            continue
-
-        start = 0
-        while True:
-            j = document.body[i].find(u"\u2013", start) # en-dash
-            k = document.body[i].find(u"\u2014", start) # em-dash
-            if j == -1 and k == -1:
-                break
-            if j == -1 or (k != -1 and k < j):
-                j = k
-            after = document.body[i][j+1:]
-            if after.startswith(u"\u200B"):
-                document.body[i] = document.body[i][:j+1] + after[1:]
-            else:
-                if len(after) == 0 and document.body[i+1].startswith(u"\u200B"):
-                    document.body[i+1] = document.body[i+1][1:]
-                    break
-            start = j+1
-        i += 1
-
+    "Set 'use_dash_ligatures' according to content."
+    use_dash_ligatures = None
+    # eventually remove preamble code from 2.3->2.2 conversion:
+    for i, line in enumerate(document.preamble):
+        if i > 1 and line == r'\renewcommand{\textemdash}{---}':
+            if (document.preamble[i-1] == r'\renewcommand{\textendash}{--}'
+                and document.preamble[i-2] == '% Added by lyx2lyx'):
+                del document.preamble[i-2:i+1]
+                use_dash_ligatures = True
+    if use_dash_ligatures is None:
+        # Look for dashes:
+        # (Documents by LyX 2.1 or older have "\twohyphens\n" or "\threehyphens\n"
+        # as interim representation for dash ligatures in 2.2.)
+        has_literal_dashes = False
+        has_ligature_dashes = False
+        j = 0
+        for i, line in enumerate(document.body):
+            # Skip some document parts where dashes are not converted
+            if (i < j) or line.startswith("\\labelwidthstring"):
+                continue
+            words = line.split()
+            if len(words) > 1 and words[0] == "\\begin_inset" and \
+            words[1] in ["CommandInset", "ERT", "External", "Formula",
+                         "FormulaMacro", "Graphics", "IPA", "listings"]:
+                j = find_end_of_inset(document.body, i)
+                if j == -1:
+                    document.warning("Malformed LyX document: "
+                        "Can't find end of %s inset at line %d" % (words[1],i))
+                continue 
+            # literal dash followed by a word or no-break space:
+            if re.search(u"[\u2013\u2014]([\w\u00A0]|$)", line, 
+                         flags=re.UNICODE):
+                has_literal_dashes = True
+            # ligature dash followed by word or no-break space on next line:
+            if re.search(ur"(\\twohyphens|\\threehyphens)", line, 
+                            flags=re.UNICODE) and re.match(u"[\w\u00A0]", 
+                            document.body[i+1], flags=re.UNICODE):
+                has_ligature_dashes = True
+        if has_literal_dashes and has_ligature_dashes:
+            # TODO: insert a warning note in the document?
+            document.warning('This document contained both literal and '
+                '"ligature" dashes.\n Line breaks may have changed. '
+                'See UserGuide chapter 3.9.1 for details.')
+        elif has_literal_dashes:
+            use_dash_ligatures = False
+        elif has_ligature_dashes:
+            use_dash_ligatures = True
+    # insert the setting if there is a preferred value
+    if use_dash_ligatures is not None:
+        i = find_token(document.header, "\\use_microtype", 0)
+        if i != -1:
+            document.header.insert(i+1, "\\use_dash_ligatures %s"
+                                % str(use_dash_ligatures).lower())
 
 def revert_dashligatures(document):
-    " Remove font ligature settings for en- and em-dashes. "
+    """Remove font ligature settings for en- and em-dashes.
+    Revert conversion of \twodashes or \threedashes to literal dashes."""
     i = find_token(document.header, "\\use_dash_ligatures", 0)
     if i == -1:
         return
@@ -1902,42 +1907,34 @@ def revert_dashligatures(document):
     i = find_token(document.header, "\\use_non_tex_fonts", 0)
     if i != -1:
         use_non_tex_fonts = get_bool_value(document.header, "\\use_non_tex_fonts", i)
-    if not use_dash_ligatures or use_non_tex_fonts:
+    if not use_dash_ligatures or use_non_tex_fonts or document.backend != "latex":
         return
 
-    # Add a zero-length space (U+200B) after en- and em-dashes
-    i = 0
-    while i < len(document.body):
-        words = document.body[i].split()
+    j = 0
+    new_body = []
+    for i, line in enumerate(document.body):
         # Skip some document parts where dashes are not converted
+        if (i < j) or line.startswith("\\labelwidthstring"):
+            new_body.append(line)
+            continue
+        words = line.split()
         if len(words) > 1 and words[0] == "\\begin_inset" and \
-           words[1] in ["CommandInset", "ERT", "External", "Formula", \
+           words[1] in ["CommandInset", "ERT", "External", "Formula",
                         "FormulaMacro", "Graphics", "IPA", "listings"]:
             j = find_end_of_inset(document.body, i)
             if j == -1:
-                document.warning("Malformed LyX document: Can't find end of " \
+                document.warning("Malformed LyX document: Can't find end of "
                                  + words[1] + " inset at line " + str(i))
-                i += 1
-            else:
-                i = j
-            continue
-        if len(words) > 0 and words[0] in ["\\leftindent", \
-                "\\paragraph_spacing", "\\align", "\\labelwidthstring"]:
-            i += 1
+            new_body.append(line)
             continue
-
-        start = 0
-        while True:
-            j = document.body[i].find(u"\u2013", start) # en-dash
-            k = document.body[i].find(u"\u2014", start) # em-dash
-            if j == -1 and k == -1:
-                break
-            if j == -1 or (k != -1 and k < j):
-                j = k
-            after = document.body[i][j+1:]
-            document.body[i] = document.body[i][:j+1] + u"\u200B" + after
-            start = j+1
-        i += 1
+        line = line.replace(u'\u2013', '\\twohyphens\n')
+        line = line.replace(u'\u2014', '\\threehyphens\n')
+        lines = line.split('\n')
+        new_body.extend(line.split('\n'))
+    document.body = new_body
+    # redefine the dash LICRs to use ligature dashes:
+    add_to_preamble(document, [r'\renewcommand{\textendash}{--}',
+                               r'\renewcommand{\textemdash}{---}'])
 
 
 def revert_noto(document):
@@ -2228,7 +2225,7 @@ def revert_mathnumberingname(document):
         else:
             l = find_token(document.header, "\\use_default_options", 0)
             document.header.insert(l, "\\options reqno")
-    # add the math_number_before tag   
+    # add the math_number_before tag
     regexp = re.compile(r'(\\math_numbering_side default)')
     i = find_re(document.header, regexp, 0)
     if i != -1:
-- 
2.11.0

Reply via email to