I debugged dwm, adding to drw.c: static void log_msg(const char* fmt, ...) { char buf[4096]; va_list args; va_start(args, fmt); vsnprintf(buf, sizeof(buf), fmt, args); va_end(args); fprintf(logfile, "%s\n", buf); }
and calls to utf8decodebyte: static long utf8decodebyte(const char c, size_t *i) { for (*i = 0; *i < (UTF_SIZ + 1); ++(*i)) if (((unsigned char)c & utfmask[*i]) == utfbyte[*i]) { log_msg("*i = %lu, for '%c' returning '%c'", *i, c, (unsigned char)c & ~utfmask[*i]); return (unsigned char)c & ~utfmask[*i]; } return 0; } and drw_text: utf8str = text; nextfont = NULL; while (*text) { log_msg("*text == '0x%X' == '%c'", *text, *text); utf8charlen = utf8decode(text, &utf8codepoint, UTF_SIZ); for (curfont = drw->fonts; curfont; curfont = curfont->next) { charexists = charexists || XftCharExists(drw->dpy, curfont->xfont, utf8codepoint); I got the following output from "thisátest.odt" // á *text == '0xFFFFFFE1' == '<E1>' *i = 3, for '<E1>' returning '^A' *i = 1, for 't' returning 't' *text == '0x74' == 't' *i = 1, for 't' returning 't' and the following from "thisátestњ.odt": // á *text == '0xFFFFFFC3' == '<C3>' *i = 2, for '<C3>' returning '^C' *i = 0, for '<A1>' returning '!' [...] // њ *text == '0xFFFFFFD1' == '<D1>' *i = 2, for '<D1>' returning '^Q' *i = 0, for '<9A>' returning '^Z' From here, it seems that dwm is receiving correct UTF-8 representations of "á" (0xC3 0xA1) and "њ" (0xD1 0x9A) for "thisátestњ.odt", but it receives ISO 8859-1 representation of "á" (no wonder, given it is passed a STRING instead of UTF8_STRING or COMPOUND_TEXT), 0xE1, followed by the next ASCII character, 0x74 ("t"), still interpreting the two as UTF-8 sequence, when those two bytes form an invalid UTF-8. That invalid UTF-8 is further passed to libfreetype or whatever, which just interrupts output at that point.
signature.asc
Description: PGP signature