On Sat, Apr 12, 2014 at 07:19:45PM +0200, Hiltjo Posthuma wrote:
> On Sat, Apr 12, 2014 at 6:58 PM, Silvan Jegen <s.je...@gmail.com> wrote:
> >
> >> I'll also probably rewrite the mmap code to use malloc since it causes
> >> issues on some machines.
> >
> > The reason we used mmap was that it allocates memory only on use. So
> > even if we mmap space for 1'114'112 ints (one for each unicode point)
> > to do the mapping, we do not use
> >
> > (1'114'112 * 4) / (1024 * 1024) = 8.5MB
> >
> > of memory but only the few characters we are actually mapping to set2.
> >
> > Using malloc would result in the memory being allocated regardless
> > of whether we actually map all unicode characters or not, leading to
> > 8.5MB of memory being used in any case (additionally we would need to
> > initialise the memory to zero to make the current algorithm work IIRC,
> > since mmap already does that automatically on access).
> >
> > If there are issues that cannot be worked around though we may not have
> > another choice. What kind of issues are you experiencing?
> >
> 
> Yeah I'm aware of how mmap works and in that way it's quite clever :).
> In the future we might want to support character classes ([:lower:],
> [:alnum:] etc). I'm not sure how well it would fit in with the current
> parsemapping code. For now I'll leave it as is and focus on more
> important things or please feel free to help and improve :)

I think it is one of those 80/20 (pareto principle) cases. The current
approach would likely have to be reworked in order to implement the
character classes. In my experience the current solution covers more
than 99% of my use cases while staying stupidly simple and thus "good
enough." To cover the remaining 1% of use cases would be probably be
quite a lot of effort but I encourage anyone to do so :-)


> Offtopic but related to tr: mbtowc() return code isn't checked at the
> moment, but can be -1 if an invalid character is given as an argument.

True. I would suggest just adding checks and bailing out when the return
code is <0. Maybe something like the following?


From: Silvan Jegen <s.je...@gmail.com>
Date: Sat, 12 Apr 2014 20:50:51 +0200
Subject: [PATCH] Wrap mbtowc to check for errors

---
 tr.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/tr.c b/tr.c
index aad2245..7579b6a 100644
--- a/tr.c
+++ b/tr.c
@@ -46,6 +46,17 @@ handleescapes(char *s)
        }
 }
 
+static int
+xmbtowc(wchar_t *unicodep, const char *s)
+{
+       int rv;
+
+       rv = mbtowc(unicodep, s, 4);
+       if (rv < 0)
+                       eprintf("mbtowc:");
+       return rv;
+}
+
 static void
 parsemapping(const char *set1, const char *set2, wchar_t *mappings)
 {
@@ -64,12 +75,12 @@ parsemapping(const char *set1, const char *set2, wchar_t 
*mappings)
        while(*s1) {
                if(*s1 == '\\')
                        handleescapes(++s1);
-               leftbytes = mbtowc(&runeleft, s1, 4);
+               leftbytes = xmbtowc(&runeleft, s1);
                s1 += leftbytes;
                if(*s2 == '\\')
                        handleescapes(++s2);
                if(*s2 != '\0') {
-                       rightbytes = mbtowc(&runeright, s2, 4);
+                       rightbytes = xmbtowc(&runeright, s2);
                        s2 += rightbytes;
                }
                mappings[runeleft] = runeright;
@@ -85,7 +96,7 @@ maptonull(const wchar_t *mappings, char *in)
 
        s = in;
        while(*s) {
-               leftbytes = mbtowc(&runeleft, s, 4);
+               leftbytes = xmbtowc(&runeleft, s);
                if(!mappings[runeleft])
                        putwchar(runeleft);
                s += leftbytes;
@@ -101,7 +112,7 @@ maptoset(const wchar_t *mappings, char *in)
 
        s = in;
        while(*s) {
-               leftbytes = mbtowc(&runeleft, s, 4);
+               leftbytes = xmbtowc(&runeleft, s);
                if(!mappings[runeleft])
                        putwchar(runeleft);
                else
-- 
1.9.2


Reply via email to