On Mon 2017-08-28 20:50:19 +0200, Jeremie Courreges-Anglas wrote: > > Hi Will,
Hi! Thank you for looking at this. > First, thanks for your submission. You're dealing with a known problem. > > The direction taken so far in ratpoison was: don't deal with wide > characters, only handle UTF-8 in a rather dumb but at least simple way. > > Rationale: > - the wide characters API has a lot of gotchas. I won't detail them > here but what to do in case of an invalid sequence often remains an > open question. Here, I can see that you return a partial length > early. I'm not sure this is desirable. I see. I'm not super familiar with the wchar.h API. I was not aware ratpoison had functionality for this! Regarding returning on invalid characters: Another option we could do would be to replace them with U+FFFD. > - UTF-8 is easy and looks like the sanest choice for a multibyte locale. > No offense, but other less commonly used locales are just a pain to > handle. Think state-dependant encodings. Well, even with UTF-8 it is not so easy to do everything perfectly! I'm not sure how wchar.h deals with there being combining characters in weird spots for instance. That might be something to look at if we ever revisited using it. > So while technically speaking the wide characters API looks like the > obvious choice, I think its cost is a bit high. Consistency is good. > If we start using the wide chars API somewhere, it should be used in all > places where it makes sense. I'm not sure this is an easy task even in > ratpoison. :) > > Handling only UTF-8 as a multibyte locale, the tentative diff below > seems to do the job. *WARNING*: I have barely tested it with your html > testcase. > > Feedback / test reports welcome. Cool! Thanks for writing that. I've tried it out and it works well. I'm in agreement about only worrying about support for UTF-8. After looking at the UTF8 macros, one thought I have is we could improve this to be more conservative about what we accept. For example, only consuming two, three, or four bytes when the first byte indicates that is appropriate, rather than having no limit. I suppose it depends how far we want to go in writing UTF-8 decoding. Anyway, I think it is a big improvement as is. It also still might be good to have a few unit tests. Would you be okay with tests in the form of my third patch? Thanks again! _______________________________________________ Ratpoison-devel mailing list Ratpoison-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/ratpoison-devel