Re: 8-bit transparency in the C locale vs. UTF-8 support (was Re: [dev] [sbase][RFC] Add a simplistic version of tr)

2013-12-25 Thread Silvan Jegen
On Tue, Dec 24, 2013 at 10:31:37PM +, Thorsten Glaser wrote: > Strake dixit: > > >Use wchar.h functions and a sane libc, e.g. musl, which has a pure > >UTF-8 C locale, which ISO C explicitly allows [1]. > > > >The 8-bit clarity what POSIX wants [1] seems nonsense to me, as one > >can use byte

8-bit transparency in the C locale vs. UTF-8 support (was Re: [dev] [sbase][RFC] Add a simplistic version of tr)

2013-12-24 Thread Thorsten Glaser
Strake dixit: >Use wchar.h functions and a sane libc, e.g. musl, which has a pure >UTF-8 C locale, which ISO C explicitly allows [1]. > >The 8-bit clarity what POSIX wants [1] seems nonsense to me, as one >can use byte functions for that, but I may be wrong. ^^ Not always, see

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-12-24 Thread sin
On Tue, Dec 24, 2013 at 01:07:10PM -0500, Strake wrote: > On 24/12/2013, Silvan Jegen wrote: > > So I guess the question boils down to whether you would rather use > > libutf or the standardized, POSIX-locale-dependent wchar.h functions for > > the UTF-8 conversion. I see one advantage of the wch

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-12-24 Thread Strake
On 24/12/2013, Silvan Jegen wrote: > So I guess the question boils down to whether you would rather use > libutf or the standardized, POSIX-locale-dependent wchar.h functions for > the UTF-8 conversion. I see one advantage of the wchar.h functions: > If we use them we could avoid adding an extern

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-12-24 Thread sin
On Tue, Dec 24, 2013 at 05:20:08PM +0100, Silvan Jegen wrote: > So I guess the question boils down to whether you would rather use > libutf or the standardized, POSIX-locale-dependent wchar.h functions for > the UTF-8 conversion. I see one advantage of the wchar.h functions: > If we use them we co

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-12-24 Thread Silvan Jegen
On Thu, Nov 28, 2013 at 12:45:40PM +0200, sin wrote: > On Tue, Nov 26, 2013 at 12:01:01PM -0800, Silvan Jegen wrote: > > If you you would rather not take this version, what approach would > > you take for the character set mapping when using UTF-8? A hashmap-, > > or B-tree-based solution or someth

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-30 Thread sin
On Sat, Nov 30, 2013 at 12:38:21PM +0100, Silvan Jegen wrote: > BTW, the most recently updated version of > the library seems to be at https://github.com/cls/libutf/commits/master > and not at http://git.suckless.org/libutf/ for some reason. I'll rebase the github repo and push it at some point so

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-30 Thread Thorsten Glaser
Silvan Jegen dixit: >That sounds reasonable but requires that we convert UTF-8 to UTF-32 >which should not be strictly necessary when we only map one UTF-8 value >to another. Arrgh, no. UTF-8 and UTF-32/UCS-4 are encodings of numerical Unicode codepoints. When working with text documents, you alw

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-30 Thread Silvan Jegen
On Thu, Nov 28, 2013 at 12:45:40PM +0200, sin wrote: > On Tue, Nov 26, 2013 at 12:01:01PM -0800, Silvan Jegen wrote: > > Hi > > > > This is a braindead and incomplete implementation of tr that only > > works for one-byte encodings. Do you think it makes sense to use this > > implementation as some

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-30 Thread Silvan Jegen
On Thu, Nov 28, 2013 at 01:24:40PM -0500, Strake wrote: > [..] > > > UTF-32 is an encoding that is identical to the unicode point as far as > > I know. So what I am thinking is that one would either use the UTF-8 > > representation of the Unicode point as an index, or the unicode point > > itself.

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-30 Thread Silvan Jegen
On Thu, Nov 28, 2013 at 07:01:17PM +, Thorsten Glaser wrote: > Silvan Jegen dixit: > > >If I understand correctly you would use mmap to allocate a sparse > >memory area into which we could then directly index (either using > >UTF-8 or UTF-32 indices), right? Since mmap needs a file descriptor

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-29 Thread Silvan Jegen
On Thu, Nov 28, 2013 at 8:21 PM, Gregor Best wrote: >> [...] >> anon = (char*)mmap(NULL, 4096, PROT_READ|PROT_WRITE, >> MAP_ANON|MAP_SHARED, -1, 0); >> >> that probably means it may not be that portable after all. Thanks for >> making me aware of it in any case. >> [...] > > *BSD has

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-28 Thread Gregor Best
> [...] > anon = (char*)mmap(NULL, 4096, PROT_READ|PROT_WRITE, > MAP_ANON|MAP_SHARED, -1, 0); > > that probably means it may not be that portable after all. Thanks for > making me aware of it in any case. > [...] *BSD has it, and one of the Gentoo machines I have access to has it to

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-28 Thread Thorsten Glaser
Silvan Jegen dixit: >If I understand correctly you would use mmap to allocate a sparse >memory area into which we could then directly index (either using >UTF-8 or UTF-32 indices), right? Since mmap needs a file descriptor I think that wouldn’t help much. >Sadly, I do not follow. I recognize tha

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-28 Thread Strake
On 28/11/2013, Silvan Jegen wrote: > On Thu, Nov 28, 2013 at 11:45:33AM -0500, Strake wrote: >> > (either using UTF-8 or UTF-32 indices), right? >> >> I meant Unicodepoints; those are just Unicodecs. > > UTF-32 is an encoding that is identical to the unicode point as far as > I know. So what I am

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-28 Thread Silvan Jegen
On Thu, Nov 28, 2013 at 11:45:33AM -0500, Strake wrote: > > (either using UTF-8 or UTF-32 indices), right? > > I meant Unicodepoints; those are just Unicodecs. UTF-32 is an encoding that is identical to the unicode point as far as I know. So what I am thinking is that one would either use the UTF

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-28 Thread Strake
On 28/11/2013, Silvan Jegen wrote: > If I understand correctly you would use mmap to allocate a sparse > memory area into which we could then directly index Yes. > (either using UTF-8 or UTF-32 indices), right? I meant Unicodepoints; those are just Unicodecs. > Since mmap needs a file descript

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-28 Thread Silvan Jegen
Thanks for the comments! On Tue, Nov 26, 2013 at 11:40 PM, Thorsten Glaser wrote: > Strake dixit: >>On 26/11/2013, Silvan Jegen wrote: >>> If you you would rather not take this version, what approach would >>> you take for the character set mapping when using UTF-8? >> >>On Linux, one can easily

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-28 Thread sin
On Tue, Nov 26, 2013 at 12:01:01PM -0800, Silvan Jegen wrote: > Hi > > This is a braindead and incomplete implementation of tr that only > works for one-byte encodings. Do you think it makes sense to use this > implementation as some kind of stopgap-measure until we have a more > robust version of

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-26 Thread Thorsten Glaser
Strake dixit: >On 26/11/2013, Silvan Jegen wrote: >> If you you would rather not take this version, what approach would >> you take for the character set mapping when using UTF-8? > >On Linux, one can easily make a sparse array with 1-page granularity >with mmap, and so simply use a (wchar_t [])

Re: [dev] [sbase][RFC] Add a simplistic version of tr

2013-11-26 Thread Strake
On 26/11/2013, Silvan Jegen wrote: > If you you would rather not take this version, what approach would > you take for the character set mapping when using UTF-8? On Linux, one can easily make a sparse array with 1-page granularity with mmap, and so simply use a (wchar_t []) or (Rune []), but I'm

[dev] [sbase][RFC] Add a simplistic version of tr

2013-11-26 Thread Silvan Jegen
Hi This is a braindead and incomplete implementation of tr that only works for one-byte encodings. Do you think it makes sense to use this implementation as some kind of stopgap-measure until we have a more robust version of tr? If you you would rather not take this version, what approach would y