Re: [RP] [PATCH 1/3 v2] Limit width of formatted text by characters rather than bytes

Jeremie Courreges-Anglas Mon, 28 Aug 2017 11:55:39 -0700

Hi Will,

On Sun, Aug 27 2017, Will Storey <w...@summercat.com> wrote:
> When formatting text for display in the window list, it is possible to
> specify a limit to truncate at. This is useful for example with %t when
> you have a long title in the window.
>
> The prior implementation truncated counting by bytes. This was
> problematic if the limit happened to be in the middle of a multibyte
> character. When that happened, the window list text cut off starting at
> the invalid character.
>
> We now count by characters rather than bytes. This ensures we always
> include a full multibyte character.
>
> It is possible to see the problem with this test case:
>
>     set winfmt %n%s%10t
>     set winliststyle row
>     set winname title
>
> Then create a window such that we truncate in the middle of a multibyte
> character. This is possible with the following HTML document:
>
>     <!DOCTYPE html>
>     <meta charset="utf-8">
>     <title>testing &trade; 1 2 3</title>
>
> Assuming you are using UTF-8 encoding, if your browser's title has only
> this text, then truncating at 10 will truncate on the second of the
> three bytes in the trademark symbol.


First, thanks for your submission.  You're dealing with a known problem.

The direction taken so far in ratpoison was: don't deal with wide
characters, only handle UTF-8 in a rather dumb but at least simple way.

Rationale:
- the wide characters API has a lot of gotchas.  I won't detail them
  here but what to do in case of an invalid sequence often remains an
  open question.  Here, I can see that you return a partial length
  early.  I'm not sure this is desirable.
- UTF-8 is easy and looks like the sanest choice for a multibyte locale.
  No offense, but other less commonly used locales are just a pain to
  handle.  Think state-dependant encodings.

So while technically speaking the wide characters API looks like the
obvious choice, I think its cost is a bit high.  Consistency is good.
If we start using the wide chars API somewhere, it should be used in all
places where it makes sense.  I'm not sure this is an easy task even in
ratpoison. :)

Handling only UTF-8 as a multibyte locale, the tentative diff below
seems to do the job.  *WARNING*: I have barely tested it with your html
testcase.

Feedback / test reports welcome.


diff --git a/src/format.c b/src/format.c
index caf8781..fa8b068 100644
--- a/src/format.c
+++ b/src/format.c
@@ -82,11 +82,18 @@ concat_width (struct sbuf *buf, char *s, int width)
 {
   if (width >= 0)
     {
-      char *s1 = xsprintf ("%%.%ds", width);
-      char *s2 = xsprintf (s1, s);
-      sbuf_concat (buf, s2);
-      free (s1);
-      free (s2);
+      int len = 0;
+
+      while (s[len] != '\0' && len < width)
+        {
+          if (RP_IS_UTF8_START (s[len]))
+            do
+              len++;
+            while (RP_IS_UTF8_CONT (s[len]));
+          else
+            len++;
+        }
+      sbuf_printf_concat (buf, "%.*s", len, s);
     }
   else
     sbuf_concat (buf, s);


-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE

signature.asc
Description: PGP signature

_______________________________________________
Ratpoison-devel mailing list
Ratpoison-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/ratpoison-devel

Re: [RP] [PATCH 1/3 v2] Limit width of formatted text by characters rather than bytes

Reply via email to