On Sat, Apr 12, 2014 at 07:19:45PM +0200, Hiltjo Posthuma wrote: > On Sat, Apr 12, 2014 at 6:58 PM, Silvan Jegen <s.je...@gmail.com> wrote: > > > >> I'll also probably rewrite the mmap code to use malloc since it causes > >> issues on some machines. > > > > The reason we used mmap was that it allocates memory only on use. So > > even if we mmap space for 1'114'112 ints (one for each unicode point) > > to do the mapping, we do not use > > > > (1'114'112 * 4) / (1024 * 1024) = 8.5MB > > > > of memory but only the few characters we are actually mapping to set2. > > > > Using malloc would result in the memory being allocated regardless > > of whether we actually map all unicode characters or not, leading to > > 8.5MB of memory being used in any case (additionally we would need to > > initialise the memory to zero to make the current algorithm work IIRC, > > since mmap already does that automatically on access). > > > > If there are issues that cannot be worked around though we may not have > > another choice. What kind of issues are you experiencing? > > > > Yeah I'm aware of how mmap works and in that way it's quite clever :). > In the future we might want to support character classes ([:lower:], > [:alnum:] etc). I'm not sure how well it would fit in with the current > parsemapping code. For now I'll leave it as is and focus on more > important things or please feel free to help and improve :)
I think it is one of those 80/20 (pareto principle) cases. The current approach would likely have to be reworked in order to implement the character classes. In my experience the current solution covers more than 99% of my use cases while staying stupidly simple and thus "good enough." To cover the remaining 1% of use cases would be probably be quite a lot of effort but I encourage anyone to do so :-) > Offtopic but related to tr: mbtowc() return code isn't checked at the > moment, but can be -1 if an invalid character is given as an argument. True. I would suggest just adding checks and bailing out when the return code is <0. Maybe something like the following? From: Silvan Jegen <s.je...@gmail.com> Date: Sat, 12 Apr 2014 20:50:51 +0200 Subject: [PATCH] Wrap mbtowc to check for errors --- tr.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/tr.c b/tr.c index aad2245..7579b6a 100644 --- a/tr.c +++ b/tr.c @@ -46,6 +46,17 @@ handleescapes(char *s) } } +static int +xmbtowc(wchar_t *unicodep, const char *s) +{ + int rv; + + rv = mbtowc(unicodep, s, 4); + if (rv < 0) + eprintf("mbtowc:"); + return rv; +} + static void parsemapping(const char *set1, const char *set2, wchar_t *mappings) { @@ -64,12 +75,12 @@ parsemapping(const char *set1, const char *set2, wchar_t *mappings) while(*s1) { if(*s1 == '\\') handleescapes(++s1); - leftbytes = mbtowc(&runeleft, s1, 4); + leftbytes = xmbtowc(&runeleft, s1); s1 += leftbytes; if(*s2 == '\\') handleescapes(++s2); if(*s2 != '\0') { - rightbytes = mbtowc(&runeright, s2, 4); + rightbytes = xmbtowc(&runeright, s2); s2 += rightbytes; } mappings[runeleft] = runeright; @@ -85,7 +96,7 @@ maptonull(const wchar_t *mappings, char *in) s = in; while(*s) { - leftbytes = mbtowc(&runeleft, s, 4); + leftbytes = xmbtowc(&runeleft, s); if(!mappings[runeleft]) putwchar(runeleft); s += leftbytes; @@ -101,7 +112,7 @@ maptoset(const wchar_t *mappings, char *in) s = in; while(*s) { - leftbytes = mbtowc(&runeleft, s, 4); + leftbytes = xmbtowc(&runeleft, s); if(!mappings[runeleft]) putwchar(runeleft); else -- 1.9.2