On Thu, Nov 28, 2013 at 12:45:40PM +0200, sin wrote: > On Tue, Nov 26, 2013 at 12:01:01PM -0800, Silvan Jegen wrote: > > If you you would rather not take this version, what approach would > > you take for the character set mapping when using UTF-8? A hashmap-, > > or B-tree-based solution or something else entirely? > > I am not knowledgeable enough about UTF-8 so I can't answer this. > A B-tree is I think an overkill for sbase. We do not have a nice > implementation of a hash table in sbase as we did not need it but > if we go down that path it makes sense to put this in util/ so other > programs can benefit. Currently we don't have an implementation of > a singly linked list that we can reuse, but that is trivial enough and > we've re-implemented it wherever needed (with the minimum set of > operations needed for each tool). I can send an implementation of > a hash table that I've used for my own programs, MIT/X licensed and it is > simple enough.
I played around with the mmap-based approach suggested in this thread and as far as I can tell it works beautifully. I will post the code as soon as I'm finished testing the program using more diverse inputs. > Regarding UTF-8, some other programs in sbase also lack proper handling > of UTF-8. Do you think we could embed libutf8 from suckless.org and > use it? In my current implementation I use libutf to convert from UTF-8 to the corresponding Unicode code points. I just realized that I use putwchar to print the converted Unicode code points which invites the question of whether we should drop libutf and use all the locale-dependent wchar.h functions like mbtowc wctomb instead. IIRC their functionality is equivalent to libutf as far as the conversion is concerned, though, according to [1], the POSIX locale seems to suck. So I guess the question boils down to whether you would rather use libutf or the standardized, POSIX-locale-dependent wchar.h functions for the UTF-8 conversion. I see one advantage of the wchar.h functions: If we use them we could avoid adding an external dependency to sbase. The disadvantage is the fact that we would depend on the whole posix-locale-thing which seems unnecessarily complicated in places. What are your thoughts? (a happy Christmas eve/Hanukkah/Spaghettimonster day btw!) [1] http://harmful.cat-v.org/software/